WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
AI 摘要
WavSLM通过蒸馏WavLM表征,实现单流语音语言建模,无需文本监督。
主要贡献
- 提出WavSLM单流语音语言模型
- 使用WavLM蒸馏学习语音表征
- 无需文本监督或预训练
方法论
WavSLM量化WavLM表征到单个码本,优化自回归的下一块预测目标,建模语音和语义信息。
原文摘要
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.