AI Agents 相关度: 8/10

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang, Jinghang Gu, Changqing Yin, Haizhou Li
arXiv: 2602.23266v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

DDTSR框架通过并行处理和动态协作,显著降低了口语对话系统的响应延迟,同时保持对话质量。

主要贡献

  • 提出DDTSR低延迟口语对话系统框架
  • 引入connective-guided small-large模型协同机制
  • 设计streaming-based跨模态协作方法

方法论

通过小型模型生成连接词,大型模型进行知识推理,并行处理ASR、LLM和TTS,并使用课程学习增强对话连贯性。

原文摘要

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

标签

口语对话系统 低延迟 流式处理 多模态 知识推理

arXiv 分类

cs.CL