Multimodal Learning 相关度: 9/10

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee
arXiv: 2603.19195v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

该论文研究LLM中的听觉知识对LALM性能的影响,并进行了全面的评估。

主要贡献

  • 评估了不同LLM的听觉知识储备
  • 揭示了文本预训练中的听觉知识与LALM性能的相关性
  • 为LLM在音频研究中的应用提供了实证依据

方法论

通过听觉知识基准测试、级联评估和音频对齐评估,对比分析不同LLM的听觉知识及对LALM性能的影响。

原文摘要

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

标签

LLM LALM Auditory Knowledge Multimodal Learning Audio Processing

arXiv 分类

eess.AS cs.CL cs.SD