Multimodal Learning 相关度: 9/10

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Reyhaneh Ahani Manghotay, Jie Liang
arXiv: 2604.01118v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

MoA-DepthCLIP利用轻量级混合适配器和选择性微调,高效地将CLIP知识迁移到单目深度估计任务。

主要贡献

  • 提出了轻量级混合适配器(MoA)模块
  • 结合深度bin分类和直接回归的混合预测架构
  • 设计了结合几何约束的复合损失函数

方法论

将MoA模块集成到预训练ViT中,利用语义上下文向量引导,并结合混合预测和复合损失函数优化深度估计。

原文摘要

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

标签

单目深度估计 CLIP 混合适配器 视觉语言模型

arXiv 分类

cs.CV cs.AI cs.LG