LLM Reasoning 相关度: 8/10

Mixture-of-Depths Attention

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
arXiv: 2603.15619v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

MoDA通过混合深度注意力机制解决LLM深度扩展中的信号衰减问题,提升模型性能。

主要贡献

  • 提出混合深度注意力机制MoDA,允许注意力头关注当前层和先前层的KV对
  • 设计硬件高效的MoDA算法,优化非连续内存访问
  • 实验证明MoDA能显著提升LLM在多项任务上的性能

方法论

引入MoDA,允许注意力头访问不同深度的KV对,并通过高效算法优化内存访问,降低计算开销。

原文摘要

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

标签

注意力机制 深度学习 语言模型 模型优化

arXiv 分类

cs.CL cs.AI