Multimodal Learning 相关度: 9/10

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie
arXiv: 2603.03276v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

研究原生多模态模型,揭示视觉和语言数据互补性,发现视觉比语言更需要数据。

主要贡献

  • 提出Representation Autoencoder (RAE) 作为统一视觉表示
  • 证明视觉和语言数据具有互补性,促进下游能力提升
  • 发现MoE架构能有效扩展多模态模型并处理视觉和语言数据的不对称性

方法论

采用Transfusion框架,通过next-token prediction和diffusion分别处理语言和视觉数据,进行从零开始的预训练实验。

原文摘要

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

标签

多模态学习 预训练 视觉语言模型 Scaling Laws MoE

arXiv 分类

cs.CV