Lifting Unlabeled Internet-level Data for 3D Scene Understanding
AI 摘要
利用网络视频自动生成3D场景理解训练数据,提升模型性能。
主要贡献
- 提出了利用无标签网络视频自动生成3D场景训练数据的方法
- 分析了数据自动生成中的瓶颈并揭示关键因素
- 验证了该方法在不同粒度感知任务上的有效性
方法论
设计数据引擎,从网络视频中自动生成训练数据,并在多种3D场景理解任务上进行验证。
原文摘要
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.