LLM Reasoning 相关度: 7/10

Effective Distillation to Hybrid xLSTM Architectures

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter
arXiv: 2603.15590v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

该论文提出一种有效的知识蒸馏方法,用于将Transformer LLM提炼到xLSTM架构上,并取得较好效果。

主要贡献

  • 提出针对xLSTM的蒸馏pipeline
  • 引入合并阶段,整合线性化专家模型
  • 在Llama, Qwen, Olmo模型上验证有效性

方法论

提出包含合并阶段的蒸馏pipeline,通过Win-and-Tie rates评估蒸馏效果,实验验证在多种模型上的有效性。

原文摘要

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

标签

知识蒸馏 xLSTM 大语言模型 模型压缩

arXiv 分类

cs.LG