Multimodal Learning 相关度: 8/10

Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour

arXiv: 2602.11072v1 发布: 2026-02-11 更新: 2026-02-11

下载 PDF arXiv 页面

AI 摘要

Hibiki-Zero无需对齐数据即可实现同步语音翻译，并通过强化学习优化延迟。

主要贡献

提出了无需词级对齐数据的语音翻译方法
使用GRPO优化延迟的同时保持翻译质量
在多个X-to-English任务上取得了最先进的性能

方法论

先用句子级数据训练高延迟语音翻译，然后用GRPO强化学习策略优化延迟，同时保证翻译质量。

原文摘要

Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

arXiv 分类

cs.CL cs.SD eess.AS

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类