LLM Memory & RAG 相关度: 7/10

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao
arXiv: 2603.05451v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

FlashAttention-4针对Blackwell GPU架构,优化Attention机制,提升计算效率。

主要贡献

  • 针对Blackwell GPU的非对称硬件扩展设计算法
  • 软件模拟指数和条件 softmax 重新缩放减少非 matmul 操作
  • 利用张量内存和 2-CTA MMA 模式减少共享内存流量和原子加法

方法论

通过重新设计的流水线、软件模拟指数函数、利用张量内存等方法,优化了Blackwell GPU上的Attention计算。

原文摘要

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

标签

Attention Transformer GPU Blackwell FlashAttention

arXiv 分类

cs.CL