Multimodal Learning 相关度: 9/10

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

arXiv: 2604.00829v1 发布: 2026-04-01 更新: 2026-04-01

下载 PDF arXiv 页面

AI 摘要

LinguDistill通过知识蒸馏恢复视觉语言模型在多模态适应中损失的语言能力，无需增加额外模块。

主要贡献

提出LinguDistill：一种adapter-free的知识蒸馏方法。
使用层级KV-cache共享，实现视觉条件下的教师模型监督。
恢复语言能力的同时，保持视觉基础能力。

方法论

使用冻结的原始语言模型作为教师，通过KV-cache共享将学生模型的模态表示暴露给教师，进行选择性的知识蒸馏。

原文摘要

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

arXiv 分类

cs.CV cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类