Multimodal Learning 相关度: 9/10

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

HanZpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He
arXiv: 2603.02748v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

iGVLM通过动态指令引导的视觉编码,提升了多模态模型在复杂推理任务中的性能。

主要贡献

  • 提出iGVLM框架,解耦表示分支和动态调节分支
  • 引入MM4诊断探针,用于评估多查询多指令下的逻辑一致性
  • 验证了iGVLM在不同语言backbone上的有效性

方法论

采用双分支结构,通过冻结的表示分支和自适应层归一化实现的动态调节分支,进行指令引导的视觉调制。

原文摘要

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

标签

多模态学习 视觉语言模型 指令学习 动态视觉编码

arXiv 分类

cs.CV cs.AI