Multimodal Learning 相关度: 9/10

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li

arXiv: 2603.17655v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

针对CDFSL中CLIP模型局部对齐问题，提出循环一致性和语义锚定机制，提升局部视觉-语言对齐和可解释性。

主要贡献

发现CDFSL中CLIP模型存在局部不对齐问题
提出循环一致性学习，利用自监督信息进行局部视觉-语言对齐
提出语义锚定机制，过滤视觉模态噪声，提升对齐效果

方法论

利用循环一致性约束和语义锚定机制，在CDFSL框架下对CLIP模型进行微调，提升局部视觉-语言对齐。

原文摘要

Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类