Multimodal Learning 相关度: 9/10

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Pol Buitrago, Pol Gàlvez, Oriol Pareras, Javier Hernando
arXiv: 2603.08249v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

提出了一种利用合成视觉数据在零视听资源下进行视听语音识别的框架。

主要贡献

  • 提出零视听资源下的视听语音识别框架
  • 利用唇形同步静态面部图像生成合成视觉流
  • 在西班牙语和加泰罗尼亚语上验证了方法的有效性

方法论

利用静态面部图像和真实音频生成合成视频,微调预训练的AV-HuBERT模型。

原文摘要

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.

标签

视听语音识别 AVSR 合成数据 唇形同步

arXiv 分类

eess.AS cs.CL eess.IV