π€ AI Summary
This work investigates whether pre-trained high-dimensional video generation models can be effectively transferred to low-dimensional controllable image synthesis. To this end, we propose the Dimensionality Reduction Attack for Controllable Generation (DRA-Ctrl), a paradigm that compresses and adapts knowledge from video models to image generation. Methodologically, we introduce the first cross-dimensional knowledge transfer framework from video to image generation; design a mixup-based inter-frame transition strategy to bridge the modality gap between temporal continuity (video) and spatial discreteness (image); and incorporate a learnable text-conditioned attention mask alongside spatiotemporal feature disentanglement and remapping to enhance textβimage alignment and control fidelity. Experiments demonstrate that DRA-Ctrl surpasses dedicated image generation models on subject-driven and spatially conditioned synthesis tasks. This is the first empirical validation that large-scale video foundation models possess strong cross-modal generative capacity and generalization ability beyond their native domain.
π Abstract
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed extit{Dimension-Reduction Attack} ( exttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. exttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is https://dra-ctrl-2025.github.io/DRA-Ctrl/.