TAC: Timestamped Audio Captioning

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing large audio language models struggle to disentangle overlapping events in complex acoustic scenes, often producing temporally misaligned descriptions and hallucinations. This work proposes TAC, a novel timestamped audio captioning framework that leverages synthetically generated dynamic multi-source audio data to produce dense, temporally localized captions at multiple granularities. The approach is further extended to TAC-V, a joint audio-visual captioning model. By integrating temporally grounded caption generation, cross-modal semantic alignment between audio and video, and cascaded reasoning with large language models, the method significantly reduces hallucination rates and achieves precise temporal modeling. It attains state-of-the-art performance across multiple benchmarks for both audio and audio-visual understanding tasks.

Technology Category

Application Category

📝 Abstract

Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.

Problem

Research questions and friction points this paper is trying to address.

audio captioning

temporal grounding

overlapping events

hallucination

polyphonic audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestamped Audio Captioning

Temporal Grounding

Synthetic Data Pipeline