LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenges of multi-hour audio question answering, including limited context length, imprecise temporal localization, and hallucination. To overcome these issues, the authors propose LongAudio-RAG, a novel framework that first parses long-form audio into timestamped, structured acoustic event records. Relevant events are then retrieved through intent recognition and natural language time parsing, and a large language model generates answers grounded in the retrieved event evidence. The approach innovatively replaces raw audio or text-to-SQL queries with event-level structured retrieval, enabling edge-cloud collaborative deployment that ensures low latency while improving accuracy. Experiments on a synthesized multi-hour audio benchmark demonstrate that LongAudio-RAG significantly outperforms conventional RAG and text-to-SQL methods, particularly excelling in detection, counting, and summarization tasks, thereby validating the efficacy of event-level retrieval.

Technology Category

Application Category

📝 Abstract

Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.

Problem

Research questions and friction points this paper is trying to address.

long-audio question answering

temporal grounding

acoustic event detection

hallucination

context-length limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LongAudio-RAG

event-grounded QA

structured audio retrieval