Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

πŸ“… 2025-05-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work investigates whether pre-trained high-dimensional video generation models can be effectively transferred to low-dimensional controllable image synthesis. To this end, we propose the Dimensionality Reduction Attack for Controllable Generation (DRA-Ctrl), a paradigm that compresses and adapts knowledge from video models to image generation. Methodologically, we introduce the first cross-dimensional knowledge transfer framework from video to image generation; design a mixup-based inter-frame transition strategy to bridge the modality gap between temporal continuity (video) and spatial discreteness (image); and incorporate a learnable text-conditioned attention mask alongside spatiotemporal feature disentanglement and remapping to enhance text–image alignment and control fidelity. Experiments demonstrate that DRA-Ctrl surpasses dedicated image generation models on subject-driven and spatially conditioned synthesis tasks. This is the first empirical validation that large-scale video foundation models possess strong cross-modal generative capacity and generalization ability beyond their native domain.

Technology Category

Application Category

πŸ“ Abstract
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed extit{Dimension-Reduction Attack} ( exttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. exttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.
Problem

Research questions and friction points this paper is trying to address.

Explores video models for controllable image generation
Bridges gap between video frames and image synthesis
Enhances image generation via video model adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-to-image knowledge compression paradigm
Mixup-based transition strategy adaptation
Redesigned attention with tailored masking
πŸ”Ž Similar Papers
No similar papers found.
H
Hengyuan Cao
Zhejiang University
Yutong Feng
Yutong Feng
Alibaba Tongyi Lab | Tsinghua University
Generative AIComputer Vision
Biao Gong
Biao Gong
Ant Group | Alibaba Group
Generative ModelRetrieval3D Vision
Y
Yijing Tian
Hangzhou Normal University
Y
Yunhong Lu
Zhejiang University
C
Chuang Liu
Hangzhou Normal University
B
Bin Wang
Kunbyte AI