Multimodal Learning 相关度: 9/10

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty
arXiv: 2603.10877v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

ARMADA框架有效将视觉-语言模型的知识迁移到纯语言模型,无需昂贵的多模态预训练。

主要贡献

  • 提出ARMADA跨模态知识蒸馏框架
  • 无需多模态预训练或调整教师模型
  • 在多个NLP任务上验证了有效性

方法论

提出了一种新的对齐技术,将视觉-语言模型的知识蒸馏到语言模型,无需改变教师模型。

原文摘要

Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

标签

知识蒸馏 跨模态学习 语言模型 视觉语言模型

arXiv 分类

cs.CL