Multimodal Learning 相关度: 9/10

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

arXiv: 2603.19166v1 发布: 2026-03-19 更新: 2026-03-19

下载 PDF arXiv 页面

AI 摘要

提出MAPG框架，分解复杂指令为子任务，提升VLM在度量约束下的视觉语言导航性能。

主要贡献

提出MAPG框架，分解复杂语言指令
设计MAPG-Bench基准测试，评估度量语义目标定位
实物机器人验证MAPG的迁移能力

方法论

MAPG将语言查询分解为子组件，利用VLM进行语义定位，再概率性组合生成符合度量标准的决策。

原文摘要

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.

arXiv 分类

cs.RO cs.AI cs.CL cs.CV cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类