Multimodal Learning 相关度: 10/10

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen
arXiv: 2603.03241v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

该论文提出了UniG2U-Bench,评估统一模型在多模态理解中生成能力的有效性。

主要贡献

  • 提出了UniG2U-Bench基准测试,包含7个类别和30个子任务
  • 评估了30多个模型,揭示了统一模型的性能弱点和优势
  • 分析了任务结构、模型架构和预训练数据对生成-理解耦合的影响

方法论

构建了涵盖多种视觉转换类型的基准测试,并对多个模型进行系统评估和分析。

原文摘要

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

标签

多模态学习 视觉语言模型 基准测试 生成-理解

arXiv 分类

cs.CV cs.AI