Screening Is Enough
AI 摘要
论文提出Multiscreen架构,通过筛选机制实现绝对Query-Key相关性,减少参数和推理延迟。
主要贡献
- 提出Multiscreen架构和筛选机制
- 减少参数数量和推理延迟
- 提升长上下文处理能力
方法论
构建Multiscreen模型,使用筛选机制代替传统softmax注意力,设定阈值过滤不相关Key,聚合剩余Key。
原文摘要
A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.