Multimodal Learning 相关度: 9/10

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan

arXiv: 2603.12249v1 发布: 2026-03-12 更新: 2026-03-12

下载 PDF arXiv 页面

AI 摘要

SciMDR提出一种合成和重构框架，构建大规模科学多模态文档推理数据集，提升模型在科学QA任务中的表现。

主要贡献

提出 synthesize-and-reground 框架
构建大规模科学多模态文档推理数据集 SciMDR
构建专家标注的 SciMDR-Eval 基准

方法论

采用两阶段pipeline：(1) Claim-Centric QA Synthesis生成QA对；(2) Document-Scale Regrounding将QA对嵌入完整文档，增加复杂性。

原文摘要

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

arXiv 分类

cs.CL cs.AI cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类