Multimodal Learning 相关度: 9/10

3D-DRES: Detailed 3D Referring Expression Segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Liujuan Cao
arXiv: 2603.02896v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

提出了新的3D Referring Expression Segmentation任务(3D-DRES),并构建了数据集DetailRefer和基线模型DetailBase。

主要贡献

  • 提出了3D-DRES任务,实现phrase到3D instance的映射
  • 构建了包含54,432个描述的DetailRefer数据集
  • 设计了支持句子和短语级别分割的DetailBase基线模型

方法论

构建了phrase-instance对应的数据集,并设计了双模式分割的基线模型DetailBase,进行实验验证。

原文摘要

Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

标签

3D Visual Grounding Referring Expression Segmentation Vision-Language Understanding

arXiv 分类

cs.CV