Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
AI 摘要
该论文提出了一种新颖的知识蒸馏框架Align-TI,用于压缩多模态大语言模型,提升性能。
主要贡献
- 提出Align-TI框架,利用token交互进行知识蒸馏
- 引入IVA模块,对齐视觉信息提取能力
- 引入TPA模块,对齐token间的转移概率
方法论
通过对齐teacher模型中视觉-指令token交互和token-token交互,使student模型模仿teacher的知识和能力。
原文摘要
Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.