Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Current large audio language models exhibit limited performance on fine-grained temporal reasoning tasks, such as inferring the precise start and end times of audio events, which hinders their application in high-precision audio understanding. To address this limitation, this work proposes TimePro-RL, a novel framework that, for the first time, integrates reinforcement learning into the temporal alignment optimization of large audio language models. The approach introduces an audio-side temporal prompting mechanism, combining timestamp embeddings, multimodal modeling, supervised fine-tuning, and reinforcement learning in a unified training paradigm. Extensive experiments demonstrate that TimePro-RL significantly improves performance on tasks including audio grounding, sound event detection, and dense audio captioning, thereby validating its effectiveness and robustness in enhancing temporal awareness within audio-language models.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

Problem

Research questions and friction points this paper is trying to address.

temporal perception

audio-language models

fine-grained audio understanding

event timing

time alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Side Time Prompt

TimePro-RL

Temporal Perception