Hierarchical Activity Recognition and Captioning from Long-Form Audio

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing approaches struggle to effectively model long-duration, hierarchically structured audio activities in real-world scenarios, often being limited to short clips and isolated events. To address this gap, this work proposes MultiAct—the first benchmark for multi-level activity understanding in long-form kitchen audio—featuring three-tier semantic annotations (activities, sub-activities, and events) along with fine-grained descriptions and high-level summaries. We further introduce a unified hierarchical multi-task model that jointly performs multi-level classification, temporal detection, sequence prediction, and multi-resolution text generation. Comprehensive experiments on MultiAct establish strong baselines and reveal key challenges in capturing the hierarchical and compositional structures inherent in long audio sequences, thereby offering new directions for future research.

Technology Category

Application Category

📝 Abstract

Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.

Problem

Research questions and friction points this paper is trying to address.

hierarchical activity recognition

long-form audio

audio captioning

complex activities

multi-level understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical activity recognition

long-form audio

multi-resolution captioning

structured understanding

unified hierarchical model

🔎 Similar Papers

No similar papers found.

Authors to Follow