Multimodal Learning 相关度: 9/10

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

arXiv: 2603.09714v1 发布: 2026-03-10 更新: 2026-03-10

下载 PDF arXiv 页面

AI 摘要

论文提出了MUGEN基准测试LALMs的多音频理解能力，并提出了改进策略。

主要贡献

提出了MUGEN基准测试
揭示了LALMs在多音频理解方面的弱点
提出了Audio-Permutational Self-Consistency和Chain-of-Thought策略来提升性能

方法论

通过构建MUGEN基准测试，评估LALMs在多音频场景下的表现，并探索无训练策略进行优化。

原文摘要

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

arXiv 分类

cs.SD cs.AI cs.CL eess.AS

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类