AI Agents 相关度: 9/10

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Shangeth Rajaa
arXiv: 2603.08216v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

DualTurn通过双声道语音预训练,提升了语音交互中智能体的turn-taking能力,减少了中断。

主要贡献

  • 提出DualTurn模型,利用双声道语音预训练学习turn-taking。
  • DualTurn模型可以预测turn-taking信号并转化为智能体行为。
  • DualTurn在turn-taking预测方面优于现有模型。

方法论

通过生成式预训练学习双声道对话语音的动态,然后微调以预测turn-taking信号。

原文摘要

Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs. 0.880), while anticipating turn boundaries earlier with fewer interruptions.

标签

语音识别 语音合成 Turn-taking 对话系统 预训练

arXiv 分类

eess.AS cs.CL cs.SD