AI Agents 相关度: 7/10

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana María Tárano, Hannah Kerner
arXiv: 2602.04870v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出了Multi-Head LatentMoE和Head Parallel,实现了高效通信和确定性MoE并行训练。

主要贡献

  • 提出了Multi-Head LatentMoE架构
  • 提出了Head Parallel (HP) 并行方法
  • 优化了Multi-Head LatentMoE的IO和expert计算

方法论

设计了一种新的MoE架构和并行策略,优化了通信成本、负载均衡和计算效率,并进行了实验验证。

原文摘要

Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$, load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O(1)$ communication cost regardless of $k$, completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61\times$ faster while having identical performance. With doubled granularity, it achieves higher overall performance while still being $1.11\times$ faster. Our method makes multi-billion-parameter foundation model research more accessible.

标签

MoE 并行计算 分布式训练 大语言模型

arXiv 分类

cs.LG