Multimodal Learning 相关度: 9/10

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao

arXiv: 2603.24166v1 发布: 2026-03-25 更新: 2026-03-25

下载 PDF arXiv 页面

AI 摘要

提出HeROD框架，通过注入启发式推理先验，提升数据稀缺场景下指代表对象检测的效率。

主要贡献

提出De-ROD任务，用于评估低数据量下的ROD性能
提出HeROD框架，注入空间和语义推理先验
证明HeROD在数据稀缺场景下优于现有方法

方法论

在DETR架构中，通过启发式方法提取空间和语义信息，在提案排序、预测融合和匈牙利匹配等阶段注入先验知识。

原文摘要

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类