Multimodal Learning 相关度: 8/10

X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models

Yueen Ma, Irwin King
arXiv: 2603.09632v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

X-GS框架统一了3DGS架构,赋能多模态模型,实现实时的语义增强在线SLAM。

主要贡献

  • 提出了X-GS框架,统一多种3DGS技术。
  • 设计了X-GS-Perceiver,实现高效的几何与姿态共优化,并从视觉基础模型提取语义特征。
  • 利用X-GS-Thinker,将语义3D高斯用于视觉-语言模型,实现下游任务。

方法论

利用在线矢量量化、GPU加速网格采样和高度并行化流水线设计,实现实时性能,并将语义高斯融入视觉-语言模型。

原文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.

标签

3D Gaussian Splatting SLAM Multimodal Learning Vision-Language Models

arXiv 分类

cs.CV cs.CL