Multimodal Learning 相关度: 9/10

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao
arXiv: 2603.15409v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

SEA-Vision,一个东南亚多语言文档和场景文本理解的综合基准。

主要贡献

  • 构建了包含11种东南亚语言的文档和场景文本理解基准SEA-Vision
  • SEA-Vision包含文档解析和文本中心视觉问答(TEC-VQA)两个任务
  • 提出了一个结合自动化过滤、MLLM辅助标注和人工验证的混合标注流程

方法论

结合自动化过滤、MLLM辅助标注和人工验证,构建多语言、多任务数据集,并评估领先的多模态模型表现。

原文摘要

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

标签

多语言 文档理解 场景文本 视觉问答 东南亚

arXiv 分类

cs.CL