Multimodal Learning 相关度: 9/10

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhan, Jiahui Yang, Can Ma, Yu Zhou, Zhenbo Luo, Jian Luan

arXiv: 2603.10495v1 发布: 2026-03-11 更新: 2026-03-11

下载 PDF arXiv 页面

AI 摘要

提出了IMTBench，一个多场景跨模态图像机器翻译评测基准，用于评估端到端图像翻译系统的性能。

主要贡献

构建了包含2500个样本的多场景图像翻译基准数据集IMTBench
提出了多方面的评估指标，包括翻译质量、背景保持、图像质量和跨模态对齐分数
评估了商业系统和开源模型，揭示了现有系统在不同场景和语言上的性能差距

方法论

构建真实场景数据集，并设计多维度评估指标，包括单模态和跨模态指标，来综合评估IIMT系统的性能。

原文摘要

End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类