Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
AI 摘要
利用视频生成模型中的隐式3D先验知识,提升MLLM在空间理解方面的能力。
主要贡献
- 提出VEGA-3D框架,利用预训练视频扩散模型作为潜在世界模拟器。
- 通过token-level自适应门控融合机制,将时空特征与语义表示融合。
- 无需显式3D监督,在多个3D场景理解任务上超越现有方法。
方法论
通过提取视频扩散模型的中间噪声层中的时空特征,并将其与MLLM的语义表示融合,增强几何理解能力。
原文摘要
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.