Multimodal Learning 相关度: 9/10

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Junhyeok Choi, Sangwoo Mo, Minwoo Chae
arXiv: 2602.19756v1 发布: 2026-02-23 更新: 2026-02-23

AI 摘要

提出一种基于原型引导数据合成的无学习多模态数据集蒸馏框架,提高跨架构泛化能力。

主要贡献

  • 提出一种无学习的多模态数据集蒸馏框架
  • 使用CLIP提取图像-文本对齐嵌入,获得原型
  • 使用unCLIP解码器合成图像,实现高效蒸馏

方法论

利用CLIP和unCLIP,通过提取原型并合成图像来蒸馏多模态数据集,无需训练和优化。

原文摘要

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images, enabling efficient and scalable multimodal dataset distillation. Extensive experiments demonstrate that our approach consistently outperforms optimization-based dataset distillation and subset selection methods, achieving state-of-the-art cross-architecture generalization.

标签

多模态学习 数据集蒸馏 CLIP 原型学习 数据合成

arXiv 分类

cs.CV