Dual-Stream Alignment for Action Segmentation

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Action segmentation aims to precisely localize the start and end timestamps of atomic actions in continuous video streams; however, existing single-stream approaches suffer from limited modeling capacity. To address this, we propose the first quantum-classical hybrid dual-stream framework: the first stream captures frame-level spatiotemporal features, while the second explicitly models action-level semantics and transition cues. A dual-stream alignment network enables cross-granularity feature collaboration via three novel alignment losses—relational consistency, cross-layer contrastive learning, and cyclic reconstruction—alongside a quantum-actuated action-guided modulation (Q-ActGM) module to enhance action-aware representation learning. Our method achieves state-of-the-art performance on four major benchmarks: GTEA, Breakfast, 50Salads, and EgoProcel. Ablation studies validate the efficacy of each component. This work pioneers the integration of quantum machine learning into action segmentation, establishing a new paradigm for multi-granularity temporal modeling.

Technology Category

Application Category

📝 Abstract
Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio- temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross- attention and Quantum-based Action-Guided Modulation (Q- ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing
Problem

Research questions and friction points this paper is trying to address.

Enhancing action segmentation via dual-stream feature alignment
Integrating quantum-classical framework for video action recognition
Addressing action-transition modeling through cross-stream communication mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream network with temporal context fusion
Quantum-classical hybrid framework for segmentation
Feature alignment via multi-component loss function
🔎 Similar Papers
No similar papers found.