Hierarchical Activity Recognition and Captioning from Long-Form Audio

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to effectively model long-duration, hierarchically structured audio activities in real-world scenarios, often being limited to short clips and isolated events. To address this gap, this work proposes MultiAct—the first benchmark for multi-level activity understanding in long-form kitchen audio—featuring three-tier semantic annotations (activities, sub-activities, and events) along with fine-grained descriptions and high-level summaries. We further introduce a unified hierarchical multi-task model that jointly performs multi-level classification, temporal detection, sequence prediction, and multi-resolution text generation. Comprehensive experiments on MultiAct establish strong baselines and reveal key challenges in capturing the hierarchical and compositional structures inherent in long audio sequences, thereby offering new directions for future research.

Technology Category

Application Category

📝 Abstract
Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.
Problem

Research questions and friction points this paper is trying to address.

hierarchical activity recognition
long-form audio
audio captioning
complex activities
multi-level understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical activity recognition
long-form audio
multi-resolution captioning
structured understanding
unified hierarchical model
🔎 Similar Papers
No similar papers found.
P
Peng Zhang
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, U.K.
Q
Qingyu Luo
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, U.K.
P
Philip J. B. Jackson
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, U.K.
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion