Multimodal Learning 相关度: 8/10

Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser
arXiv: 2602.16334v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

该论文研究了动态声源运动场景下的空间音频问答,并提出了相应的解决方案。

主要贡献

  • 提出了运动中心的空间音频增强框架
  • 设计了带有思考模式的端到端多模态微调方法
  • 研究了查询条件下的源分离作为预处理的影响

方法论

通过合成数据增强、多模态微调和源分离等方法,提升模型对动态声源空间音频的理解和推理能力。

原文摘要

Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

标签

空间音频 问答 运动推理 多模态学习 源分离

arXiv 分类

cs.SD cs.AI