DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection
AI 摘要
DetPO提出了一种黑盒prompt优化方法,提升MLLM在少样本目标检测任务上的性能。
主要贡献
- 提出了一种名为DetPO的梯度无关的prompt优化方法。
- DetPO通过最大化检测精度和校准置信度来优化文本prompt。
- 实验结果表明DetPO在Roboflow20-VL和LVIS数据集上优于现有黑盒方法。
方法论
通过在少量视觉训练样本上最大化检测精度来迭代优化文本prompt,并校准预测置信度,是一种黑盒优化方法。
原文摘要
Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO