Multimodal Learning 相关度: 9/10

Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

Junxin Lu, Tengfei Song, Zhanglin Wu, Pengfei Li, Xiaowei Liang, Hui Yang, Kun Chen, Ning Xie, Yunfei Lu, Jing Zhao, Shiliang Sun, Daimeng Wei
arXiv: 2602.21956v1 发布: 2026-02-25 更新: 2026-02-25

AI 摘要

GLoTran通过全局-局部双重感知提升MLLM在高分辨率富文本图像翻译任务上的性能。

主要贡献

  • 提出GLoTran框架,利用全局图像和局部文本切片增强视觉感知
  • 构建大规模高分辨率富文本图像翻译数据集GLoD
  • 实验证明GLoTran在翻译完整性和准确性上优于现有MLLM

方法论

GLoTran采用指令引导的对齐策略,结合低分辨率全局图像和多尺度局部文本图像切片,训练MLLM。

原文摘要

Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.

标签

MLLM 图像翻译 多模态学习 高分辨率图像 文本识别

arXiv 分类

cs.CV