Multimodal Learning 相关度: 9/10

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
arXiv: 2603.12252v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

EndoCoT通过迭代细化潜在思想状态,并将其与扩散模型的去噪过程桥接,增强了MLLM的推理能力。

主要贡献

  • 提出了EndoCoT框架,增强了MLLM在扩散模型中的推理能力
  • 引入迭代思想引导模块,激活MLLM的推理潜力
  • 采用终端思想接地模块,确保推理轨迹与文本监督对齐

方法论

通过迭代细化潜在思想状态并进行接地,增强MLLM在DiT模型中的推理指导能力,实现逐步解决复杂任务。

原文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

标签

Multimodal Learning Diffusion Models Chain-of-Thought Reasoning

arXiv 分类

cs.CV cs.CL