3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio
AI 摘要
VIRST-Audio模型利用文本监督进行音频视频对象分割,通过ASR转换音频并引入存在感知门控提升鲁棒性,在MeViS-Audio挑战赛中获得第三名。
主要贡献
- 提出VIRST-Audio框架,结合预训练RVOS模型和视觉语言架构。
- 利用ASR模块将音频转换为文本,实现文本监督的分割。
- 引入存在感知门控机制,提高鲁棒性,减少幻觉mask。
方法论
使用预训练RVOS模型,通过ASR将音频转文本,利用文本进行监督训练,并加入存在感知门控抑制不存在的对象mask。
原文摘要
Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.