Smooth regularization for efficient video recognition

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Lightweight video recognition models suffer from weak temporal modeling capacity and difficulty capturing natural sequential coherence. To address this, we propose a Gaussian Random Walk (GRW)-based smoothing regularization method that imposes a low-acceleration constraint on intermediate-layer feature sequences, explicitly modeling the continuous evolution of frame-wise representations, suppressing abrupt transitions, and strengthening temporal inductive bias. The method is plug-and-play, compatible with lightweight architectures such as MoViNets and MobileNetV3, requiring no architectural modifications or additional inference overhead. On Kinetics-600, it improves MoViNets accuracy by 3.8–6.4%, surpassing state-of-the-art methods at comparable FLOPs; MobileNetV3 and MoViNets-Stream also achieve significant gains under similar memory budgets. Our core contribution is the first application of GRW to video representation learning—delivering differentiable, computationally efficient temporal modeling enhancement for lightweight models.

Technology Category

Application Category

📝 Abstract

We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

Problem

Research questions and friction points this paper is trying to address.

Improving video recognition accuracy for lightweight models

Enforcing temporal smoothness in video frame representations

Enhancing temporal coherence modeling in efficient architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smooth regularization technique for video recognition models

Gaussian Random Walk modeling for temporal coherence

Improved accuracy for lightweight video architectures

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA

Video Algorithms Intern, Video Coding (Gaussian Splatting), Fall 2026

Netflix

The overall market range for Netflix Internships is typically $40/hour - $110/hour.

Los Gatos, CA, USA / Los Angeles, CA, USA

AI Research Scientist, Computer Vision - Facebook Video Intelligence