Multimodal Learning 相关度: 9/10

UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick
arXiv: 2602.08336v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

论文提出了UReason基准测试,揭示了统一多模态模型中推理在视觉合成中的悖论现象。

主要贡献

  • 提出了UReason基准测试,包含2000个实例,覆盖五种推理任务。
  • 设计了一种评估框架,比较直接生成、推理引导生成和去语境化生成。
  • 揭示了推理悖论现象:推理痕迹改善性能,但保留中间思想反而阻碍视觉合成。

方法论

通过UReason基准,对比不同生成方式在多模态模型上的表现,分析推理痕迹的作用。

原文摘要

To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

标签

multimodal reasoning image generation benchmark

arXiv 分类

cs.CL cs.CV