Multimodal Learning 相关度: 9/10

DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer

Zhengxu He, Jun Li, Zhijian Wu
arXiv: 2603.15166v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

提出DAIT,利用中间教师网络自适应地将VLM知识迁移到轻量级分类器,提升细粒度图像分类性能。

主要贡献

  • 提出DAIT框架,解决VLM到轻量级模型知识蒸馏的对齐问题
  • 引入可训练的中间教师网络,提取任务相关的判别性视觉线索
  • 在多个细粒度图像分类数据集上验证了DAIT的有效性

方法论

DAIT通过中间教师网络学习VLM的表示,并适应性地增强判别性视觉信息,然后将知识蒸馏到轻量级模型中。

原文摘要

Large-scale Vision-Language Models (VLMs) encode rich multimodal semantics that are highly beneficial for fine-grained visual categorization (FGVC). However, their prohibitive computational cost hinders practical deployment in resource-constrained environments. Although knowledge distillation contributes to transferring VLMs capacity to lightweight classifiers, conventional distillation mechanisms, which directly transfer from a generic VLM to a compact student, often yield suboptimal results due to severe architectural misalignment and introducing task-irrelevant information. To alleviate this limitation, we propose Distillation with Adaptive Intermediate Teacher transfer (DAIT) in this study, facilitating adaptive knowledge transfer from VLMs to lightweight students. DAIT introduces a trainable intermediate teacher that learns to transfer frozen VLMs representations under explicit supervision from the target fine-grained task. This intermediate teacher adaptively enhances discriminative visual cues, thereby producing compact and task-aligned knowledge that can be reliably distilled into lightweight models. Extensive evaluations on multiple FGVC benchmarks with diverse student architectures demonstrate that our method achieves respective performance gains of 12.63% and 8.34% on FGVC-Aircraft and CUB-200-2011 datasets, establishing DAIT as a principled paradigm for transferring from general-purpose VLMS to deployable fine-grained recognition models.

标签

知识蒸馏 视觉语言模型 细粒度图像分类 中间教师

arXiv 分类

cs.CV