LLM Reasoning 相关度: 8/10

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng
arXiv: 2602.15521v1 发布: 2026-02-17 更新: 2026-02-17

AI 摘要

ExpertWeaver利用GLU激活模式将稠密LLM转化为高效MoE,无需训练且性能优于现有方法。

主要贡献

  • 提出ExpertWeaver框架,一种无需训练的稠密模型到MoE的转换方法
  • 发现GLU激活模式揭示了LLM中固有的MoE结构
  • 在动态结构剪枝和MoE初始化方面优于现有方法

方法论

ExpertWeaver根据神经元的激活模式划分神经元,构建共享专家和专用路由专家,实现层自适应配置的MoE结构。

原文摘要

Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

标签

Mixture-of-Experts MoE Gated Linear Unit GLU Dense-to-Sparse

arXiv 分类

cs.CL cs.LG