Agent Tuning & Optimization 相关度: 6/10

Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini

arXiv: 2604.02292v1 发布: 2026-04-02 更新: 2026-04-02

下载 PDF arXiv 页面

AI 摘要

提出了一种针对Transformer模型中Softmax计算瓶颈的快速近似方法HCCS，优化了int8推理速度并保持精度。

主要贡献

提出了Head-Calibrated Clipped-Linear Softmax (HCCS)，一种softmax的快速替代方案。
HCCS针对AMD Versal AI Engines的int8 multiply accumulate (MAC)单元进行了优化。
验证了HCCS在小模型和重度量化MHA工作负载上的性能和精度优势。

方法论

通过离线优化校准参数，对max centered attention logits进行截断线性映射，实现softmax函数的近似。

原文摘要

Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.

arXiv 分类

cs.LG cs.AR

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类