Multimodal Learning 相关度: 7/10

Quid est VERITAS? A Modular Framework for Archival Document Analysis

Leonardo Bassanini, Ludovico Biancardi, Alfio Ferrara, Andrea Gamberini, Sergio Picascia, Folco Vaglienti
arXiv: 2603.28108v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

VERITAS框架将文档数字化重构为集成工作流,提升转录质量和下游应用。

主要贡献

  • 提出VERITAS模块化框架,用于档案文档分析
  • 实现了转录、版面分析和语义增强的集成
  • 在历史文献数据集上验证了框架的有效性

方法论

采用模块化、模型无关的架构,通过预处理、提取、精炼和增强四个阶段,实现文档的数字化和分析。

原文摘要

The digitisation of historical documents has traditionally been conceived as a process limited to character-level transcription, producing flat text that lacks the structural and semantic information necessary for substantive computational analysis. We present VERITAS (Vision-Enhanced Reading, Interpretation, and Transcription of Archival Sources), a modular, model-agnostic framework that reconceptualises digitisation as an integrated workflow encompassing transcription, layout analysis, and semantic enrichment. The pipeline is organised into four stages - Preprocessing, Extraction, Refinement, and Enrichment - and employs a schema-driven architecture that allows researchers to declaratively specify their extraction objectives. We evaluate VERITAS on the critical edition of Bernardino Corio's Storia di Milano, a Renaissance chronicle of over 1,600 pages. Results demonstrate that the pipeline achieves a 67.6% relative reduction in word error rate compared to a commercial OCR baseline, with a threefold reduction in end-to-end processing time when accounting for manual correction. We further illustrate the downstream utility of the pipeline's output by querying the transcribed corpus through a retrieval-augmented generation system, demonstrating its capacity to support historical inquiry.

标签

文档分析 数字化 OCR 历史文献 信息抽取

arXiv 分类

cs.DL cs.AI cs.IR