Multimodal Learning 相关度: 9/10

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee
arXiv: 2603.09714v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

论文提出了MUGEN基准测试LALMs的多音频理解能力,并提出了改进策略。

主要贡献

  • 提出了MUGEN基准测试
  • 揭示了LALMs在多音频理解方面的弱点
  • 提出了Audio-Permutational Self-Consistency和Chain-of-Thought策略来提升性能

方法论

通过构建MUGEN基准测试,评估LALMs在多音频场景下的表现,并探索无训练策略进行优化。

原文摘要

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

标签

LALM 多音频理解 基准测试 Audio-Permutational Self-Consistency Chain-of-Thought

arXiv 分类

cs.SD cs.AI cs.CL eess.AS