🤖 AI Summary
Video salient object detection (VSOD) heavily relies on authentic motion cues, yet suffers from severe scarcity of annotated video data. Existing methods that synthesize pseudo-video sequences from static images fail to produce semantically coherent and temporally consistent optical flow, resulting in poor motion-guided detection performance. To address this, we propose the first framework that transfers semantic motion priors from a pre-trained video diffusion model to VSOD. Our method conditions optical flow generation on a single input image, explicitly decoupling content and motion representations to synthesize physically plausible, scene-aware flow fields and corresponding training video sequences. Unlike conventional spatial transformation-based approaches, our method overcomes inherent limitations in motion realism. Extensive experiments demonstrate significant improvements across multiple VSOD benchmarks, validating both the effectiveness and generalizability of motion knowledge transfer.
📝 Abstract
Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.