Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
AI 摘要
论文提出用结构化的哈达玛变换替代多头注意力中的密集输出投影,提升效率。
主要贡献
- 减少模型参数量
- 提升推理速度和内存效率
- 保持或略微提升下游任务性能
方法论
用固定的哈达玛变换和可学习的仿射重缩放替换密集输出投影。
原文摘要
The dense output projection in multi-head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter-free Walsh Hadamard Transform followed by a lightweight learnable affine rescaling, eliminating approximately 25 percent of attention parameters per block while preserving global cross head interaction through an orthogonal, norm-preserving transformation. Across different model sizes, we demonstrate that this structured substitution maintains comparable or slightly superior downstream task performance on standard benchmarks, while achieving up to 7 percent aggregate parameter reduction, 8.9 percent peak memory savings, and 6.6 percent throughput improvement at scale, with efficiency gains growing monotonically with model size, batch size, and sequence length. Interestingly, we observe that structured Hadamard-based models exhibit a steeper validation loss curve relative to training FLOPs compared to their dense counterparts, suggesting more favorable compute utilization during training.