Multimodal Learning 相关度: 10/10

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, Shiming Xiang
arXiv: 2602.09483v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

该论文提出了一种新颖的知识蒸馏框架Align-TI,用于压缩多模态大语言模型,提升性能。

主要贡献

  • 提出Align-TI框架,利用token交互进行知识蒸馏
  • 引入IVA模块,对齐视觉信息提取能力
  • 引入TPA模块,对齐token间的转移概率

方法论

通过对齐teacher模型中视觉-指令token交互和token-token交互,使student模型模仿teacher的知识和能力。

原文摘要

Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

标签

Knowledge Distillation Multimodal Learning Large Language Models

arXiv 分类

cs.CV