LLM Memory & RAG 相关度: 7/10

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu, Yudong Pan, Mengdi Wang, Huawei Li, Yinhe Han, Xiaowei Li, Ying Wang

arXiv: 2602.11016v1 发布: 2026-02-11 更新: 2026-02-11

下载 PDF arXiv 页面

AI 摘要

提出3D-Flow架构和3D-FlashAttention方法，加速Transformer模型，降低能耗并提升速度。

主要贡献

提出了3D-Flow，一种混合键合的3D堆叠空间加速器
设计了3D-FlashAttention，一种细粒度调度方法
实验结果表明，3D加速器显著降低能耗并提升速度

方法论

设计了3D空间加速器架构和数据流调度方法，并通过Transformer模型在提出的架构上进行性能评估。

原文摘要

Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.

arXiv 分类

cs.AR cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类