Multimodal Learning 相关度: 9/10

3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting

Wancai Zheng, Hao Chen, Xianlong Lu, Linlin Ou, Xinyi Yu
arXiv: 2602.12159v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

提出3DGSNav,利用3D高斯溅射增强视觉语言模型在对象导航中的空间推理能力。

主要贡献

  • 将3D高斯溅射作为VLMs的持久记忆
  • 设计结构化视觉提示和CoT提示
  • 通过实时对象检测和VLM驱动的视点切换进行目标验证

方法论

构建环境的3DGS表示,通过主动感知和轨迹引导自由视点渲染,结合视觉提示和CoT提升VLM推理。

原文摘要

Object navigation is a core capability of embodied intelligence, enabling an agent to locate target objects in unknown environments. Recent advances in vision-language models (VLMs) have facilitated zero-shot object navigation (ZSON). However, existing methods often rely on scene abstractions that convert environments into semantic maps or textual representations, causing high-level decision making to be constrained by the accuracy of low-level perception. In this work, we present 3DGSNav, a novel ZSON framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for VLMs to enhance spatial reasoning. Through active perception, 3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views. Moreover, we design structured visual prompts and integrate them with Chain-of-Thought (CoT) prompting to further improve VLM reasoning. During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification, ensuring efficient and reliable recognition. Extensive evaluations across multiple benchmarks and real-world experiments on a quadruped robot demonstrate that our method achieves robust and competitive performance against state-of-the-art approaches.The Project Page:https://aczheng-cai.github.io/3dgsnav.github.io/

标签

3D Gaussian Splatting Vision-Language Models Object Navigation Chain-of-Thought Active Perception

arXiv 分类

cs.RO cs.AI