ERNIE 5.0 Technical Report
AI 摘要
ERNIE 5.0 是一个统一多模态理解和生成的原生自回归基础模型,具有弹性训练和MoE架构。
主要贡献
- 提出了统一多模态理解和生成的原生自回归基础模型ERNIE 5.0
- 采用超稀疏混合专家(MoE)架构和模态无关的专家路由
- 引入弹性训练范式,支持性能、模型大小和推理延迟的灵活权衡
方法论
采用原生自回归架构,统一的下一组tokens预测目标,超稀疏MoE架构,模态无关专家路由和弹性训练。
原文摘要
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.