LLM Memory & RAG 相关度: 9/10

LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar, Yinyi Guo, Erik Visser
arXiv: 2602.14612v1 发布: 2026-02-16 更新: 2026-02-16

AI 摘要

提出了LongAudio-RAG框架,利用事件检测结果而非原始音频进行RAG,提升长音频问答性能。

主要贡献

  • 提出了 LongAudio-RAG 框架
  • 构建了长音频问答合成数据集
  • 在边缘-云环境中部署并验证了框架的实用性

方法论

构建事件记录数据库,解析时间引用,分类意图,检索相关事件,利用LLM生成答案。

原文摘要

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

标签

长音频 问答 RAG 事件检测

arXiv 分类

eess.AS cs.AI cs.LG