LLM Memory & RAG 相关度: 8/10

Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander
arXiv: 2603.11611v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

研究了部分RoPE对Transformer性能的影响,发现小比例RoPE即可达到与完整RoPE相近的效果,并显著节省内存。

主要贡献

  • 研究了部分RoPE对模型性能和收敛性的影响
  • 发现使用小比例RoPE即可达到与完整RoPE相近的性能
  • 揭示了NoPE模型训练不稳定问题以及RoPE的缓解作用

方法论

通过在不同架构和数据集上进行系统实验,分析了部分RoPE对训练动态和收敛性的影响,并与完整RoPE和NoPE进行比较。

原文摘要

Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.

标签

RoPE Positional Encoding Transformer Memory Efficiency

arXiv 分类

cs.LG cs.CL