Effective Distillation to Hybrid xLSTM Architectures
AI 摘要
该论文提出一种有效的知识蒸馏方法,用于将Transformer LLM提炼到xLSTM架构上,并取得较好效果。
主要贡献
- 提出针对xLSTM的蒸馏pipeline
- 引入合并阶段,整合线性化专家模型
- 在Llama, Qwen, Olmo模型上验证有效性
方法论
提出包含合并阶段的蒸馏pipeline,通过Win-and-Tie rates评估蒸馏效果,实验验证在多种模型上的有效性。
原文摘要
There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.