LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high computational cost of video recognition with Transformers, which stems from processing all spatiotemporal tokens despite significant redundancy in video data. To overcome this, the authors propose LookWhen, a framework that decouples recognition into three lightweight decisions—when, where, and what to compute—using efficient submodules to select critical tokens, which are then modeled by a deep extractor to approximate full-video representations. Key innovations include uniqueness-aware token selection supervision, multi-source knowledge distillation leveraging both image and video teachers, nearest-neighbor-distance-driven selector pretraining, and frame-level representation normalization. Evaluated across six video benchmarks, LookWhen substantially outperforms existing efficient models, achieving Pareto optimality in 9 out of 12 metrics and running 6.7× faster than InternVideo2-B at comparable accuracy.

📝 Abstract

Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through these strategies, our selector-extractor learns general and efficient representations for feature extraction or fine-tuning to a task. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks x 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, LookWhen is more efficient still at 6.7x faster than InternVideo2-B at equal accuracy.

Problem

Research questions and friction points this paper is trying to address.

video recognition

computational efficiency

token redundancy

accuracy-computation trade-off

efficient inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

selector-extractor

token selection

efficient video recognition