Multimodal Learning 相关度: 9/10

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

Yihuan Huang, Jun Xue, Liu Jiajun, Daixian Li, Tong Zhang, Zhuolin Yi, Yanzhen Ren, Kai Li
arXiv: 2603.22915v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

针对视频会议场景下的AVSR性能退化问题,构建了MLD-VC数据集并分析了原因,提出了优化方法。

主要贡献

  • 构建了首个面向视频会议的AVSR多模态数据集MLD-VC
  • 分析了视频会议场景下AVSR性能退化的原因,包括传输失真和人类过度表达
  • 发现语音增强算法是导致分布偏移的主要来源,并提出了基于MLD-VC的微调方法

方法论

构建数据集,分析模型在视频会议平台的性能,研究语音增强算法的影响,通过在数据集上微调模型提升性能。

原文摘要

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

标签

Audio-Visual Speech Recognition Video Conferencing Multimodal Learning

arXiv 分类

cs.CV