LLM Memory & RAG 相关度: 7/10

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang
arXiv: 2602.05725v1 发布: 2026-02-05 更新: 2026-02-05

AI 摘要

论文研究了Muon优化器在联想记忆学习中的训练动态和缩放规律,揭示其优于梯度下降的原因。

主要贡献

  • 证明了Muon在无噪声情况下比梯度下降快指数级
  • 推导了噪声情况下Muon的优化缩放律,并证明其优于梯度下降
  • 解释了Muon可以被视为隐式的矩阵预处理器

方法论

使用线性联想记忆模型和softmax检索,分析了Muon和梯度下降在不同频率分量下的学习速率和缩放规律。

原文摘要

Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon's optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.

标签

优化器 Muon 联想记忆 缩放律

arXiv 分类

cs.LG math.OC stat.ML