Multimodal Learning 相关度: 9/10

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng
arXiv: 2604.02097v1 发布: 2026-04-02 更新: 2026-04-02

AI 摘要

LatentUM通过共享潜在空间统一多模态表征,实现高效且无偏的跨模态推理和生成。

主要贡献

  • 提出了LatentUM,一种新型统一模型。
  • 消除了视觉理解和生成之间像素空间的依赖。
  • 在视觉空间规划等任务上取得了SOTA性能。

方法论

构建一个共享语义潜在空间的统一模型,将不同模态的信息映射到同一空间,避免像素空间作为桥梁。

原文摘要

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

标签

多模态学习 统一模型 跨模态推理 视觉生成

arXiv 分类

cs.CV cs.LG