Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
AI 摘要
SFI通过解耦快慢推理步骤,在长文本生成中加速推理,无需额外训练。
主要贡献
- 提出了一种无需训练的加速推理框架Slow-Fast Inference (SFI)
- 观察到句子内部注意力支撑的稳定性规律
- 通过实验验证了SFI在长文本和CoT设置下的加速效果
方法论
SFI将解码分为频繁的低成本快速步骤和偶尔的密集注意力慢速步骤,利用稀疏内存和选择器刷新内存。
原文摘要
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.