DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
AI 摘要
DualTurn通过双声道语音预训练,提升了语音交互中智能体的turn-taking能力,减少了中断。
主要贡献
- 提出DualTurn模型,利用双声道语音预训练学习turn-taking。
- DualTurn模型可以预测turn-taking信号并转化为智能体行为。
- DualTurn在turn-taking预测方面优于现有模型。
方法论
通过生成式预训练学习双声道对话语音的动态,然后微调以预测turn-taking信号。
原文摘要
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.