🤖 AI Summary
Existing video layer decomposition models suffer from poor generalization and slow convergence, as they require training implicit neural representations (INRs) independently for each video. To address this, we propose the first hypernetwork-based meta-learning framework for video decomposition. Our method employs a video encoder to extract global semantic embeddings, which condition a hypernetwork to generate compact, video-specific INR parameters—enabling cross-video knowledge transfer and mitigating overfitting to individual videos. Evaluated on multiple benchmarks, our approach accelerates training by 3–5× while maintaining or improving layer separation quality, and supports real-time, editing-grade inference. The core contribution is the first integration of hypernetworks into video layer decomposition, establishing an efficient and generalizable meta-learning paradigm for this task.
📝 Abstract
Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: https://hypernvd.github.io/