SEAL: Semantic Attention Learning for Long Video Representation

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-form video understanding faces dual challenges of high computational complexity and severe temporal redundancy. To address these, we propose Semantic Attention Learning (SEAL), a unified representation framework that, for the first time, decomposes videos into three semantic entities—scenes, objects, and actions—and formulates representation learning as a subset selection optimization problem with explicit diversity constraints. This enables joint modeling of relevance and diversity. Departing from conventional frame- or pixel-level processing paradigms, SEAL integrates semantic decomposition, diversity-aware attention mechanisms, and multi-task joint learning to significantly improve both representational efficiency and discriminability. Extensive experiments demonstrate state-of-the-art performance across video question answering and temporal grounding tasks on major benchmarks, including LVBench, MovieChat-1K, and Ego4D.

Technology Category

Application Category

📝 Abstract
Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks. Extensive experiments demonstrate that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks across diverse benchmarks, including LVBench, MovieChat-1K, and Ego4D.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational complexity in long video processing
Addressing redundant temporal information in videos
Improving representation for diverse video understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes videos into scenes, objects, actions
Uses attention learning to balance relevance, diversity
Optimizes subset selection for redundancy reduction
🔎 Similar Papers
No similar papers found.
L
Lan Wang
Michigan State University, Google
Y
Yujia Chen
Google
Du Tran
Du Tran
Google
V
V. Boddeti
Michigan State University
Wen-Sheng Chu
Wen-Sheng Chu
Research Scientist, Google
Computer VisionMachine Learning