MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models
AI 摘要
论文提出了MUGEN基准测试LALMs的多音频理解能力,并提出了改进策略。
主要贡献
- 提出了MUGEN基准测试
- 揭示了LALMs在多音频理解方面的弱点
- 提出了Audio-Permutational Self-Consistency和Chain-of-Thought策略来提升性能
方法论
通过构建MUGEN基准测试,评估LALMs在多音频场景下的表现,并探索无训练策略进行优化。
原文摘要
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.