Multimodal Learning 相关度: 9/10

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, Min Zhang
arXiv: 2602.04486v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

该论文提出一种新的多模态大型语言模型(MLLM)方法,用于解决GMNER中的模态偏差问题,提升性能。

主要贡献

  • 揭示了MLLMs在GMNER中存在的模态偏差问题(视觉偏差和文本偏差)
  • 提出了模态感知一致性推理(MCR)方法,包括MRSI和CVO
  • 设计了MRSI,将抽象约束转化为可执行的推理链
  • 利用CVO和GRPO,使模型能够动态调整推理轨迹

方法论

提出MCR,通过MRSI注入多风格推理模式,并通过CVO结合GRPO优化推理轨迹,缓解模态偏差。

原文摘要

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

标签

GMNER MLLM 多模态 模态偏差 一致性推理

arXiv 分类

cs.CL