Multimodal Learning 相关度: 9/10

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels
arXiv: 2603.15118v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

VAREX是一个用于评估多模态模型从政府表格中抽取结构化数据的基准。

主要贡献

  • 提出了VAREX基准,用于评估多模态模型结构化数据抽取能力
  • 使用了Reverse Annotation pipeline生成确定性ground truth
  • 提供了四种输入模态:纯文本、布局文本、文档图像、文本和图像组合

方法论

使用Reverse Annotation pipeline程序化填充PDF模板生成合成值,通过三阶段质量保证验证。

原文摘要

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

标签

多模态学习 信息抽取 基准测试 结构化数据

arXiv 分类

cs.CV