Multimodal Learning 相关度: 8/10

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng, Zichao Zeng, James Haworth
arXiv: 2602.16590v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

提出CLIP-MHAdapter,一种基于注意力机制的CLIP轻量级适配方法,用于街景图像属性分类。

主要贡献

  • 提出CLIP-MHAdapter模型
  • 在Global StreetScapes数据集上取得SOTA结果
  • 低计算成本,高性能

方法论

在CLIP基础上,附加一个带多头自注意力的瓶颈MLP,用于建模patch间的依赖关系。

原文摘要

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

标签

街景图像分类 CLIP 注意力机制 轻量级模型

arXiv 分类

cs.CV cs.AI cs.LG