Deep Kernel Fusion for Transformers
arXiv: 2602.11808v1
发布: 2026-02-12
更新: 2026-02-12
AI 摘要
提出了DeepFusionKernel,一种深度融合内核,优化Transformer中SwiGLU MLP块的内存带宽瓶颈,提升推理速度。
主要贡献
- 提出DeepFusionKernel优化SwiGLU MLP块
- 减少HBM流量并提高缓存重用率
- 集成SGLang并提供内核调度器,提升推理速度
方法论
通过深度融合内核,优化内存带宽,提高缓存利用率,实现加速Transformer推理的目的。
原文摘要
Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.