LLM Reasoning 相关度: 8/10

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu
arXiv: 2602.23225v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

论文分析了扩散语言模型并行解码退化为自回归的原因,并提出一种数据驱动方法NAP提升并行解码性能。

主要贡献

  • 发现训练数据是导致DLM并行解码退化为自回归的原因之一
  • 提出NAP方法,通过数据处理和并行强制解码策略优化并行解码
  • 实验证明NAP在数学推理任务上提升了并行解码性能

方法论

通过分析DLM的训练数据与目标不匹配,提出了数据驱动的NAP方法,包括数据增强和并行强制解码策略。

原文摘要

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

标签

Diffusion Language Models Non-Autoregressive Decoding Parallel Decoding Chain-of-Thought

arXiv 分类

cs.CL cs.AI