Multimodal Learning 相关度: 9/10

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
arXiv: 2603.19217v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

提出了LVOmniBench,用于评估OmniLLM在长音频视频理解方面的能力。

主要贡献

  • 提出了LVOmniBench基准数据集,包含275个长视频和1014个QA对
  • 揭示了现有OmniLLM在处理长音频视频时面临的挑战
  • 提供了长音频视频理解评估的实验结果和分析

方法论

构建包含长视频及其QA对的数据集,并用该数据集评估现有OmniLLM的性能,分析其在长期记忆、时序定位等方面的能力。

原文摘要

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

标签

长音频视频 OmniLLM 多模态学习 基准测试

arXiv 分类

cs.CV