Multimodal Learning 相关度: 9/10

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han

arXiv: 2604.02289v1 发布: 2026-04-02 更新: 2026-04-02

下载 PDF arXiv 页面

AI 摘要

Omni123通过统一文本到2D和3D生成，利用2D数据提升3D建模效果。

主要贡献

提出3D-native的文本到2D/3D统一生成模型Omni123
引入图像和3D之间的跨模态一致性作为结构约束
提出交错X-to-X训练范式，利用非完全对齐的数据

方法论

使用自回归框架，将文本、图像和3D表示为离散token，利用2D数据作为3D表示的几何先验，进行跨模态联合训练。

原文摘要

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类