LLM Reasoning 相关度: 7/10

Deep Kernel Fusion for Transformers

Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins

arXiv: 2602.11808v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

提出了DeepFusionKernel，一种深度融合内核，优化Transformer中SwiGLU MLP块的内存带宽瓶颈，提升推理速度。

主要贡献

提出DeepFusionKernel优化SwiGLU MLP块
减少HBM流量并提高缓存重用率
集成SGLang并提供内核调度器，提升推理速度

方法论

通过深度融合内核，优化内存带宽，提高缓存利用率，实现加速Transformer推理的目的。

原文摘要

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.

arXiv 分类

cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类