SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

๐Ÿ“… 2025-08-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Controllable video generation suffers from weak semantic consistency and difficulty in precisely responding to fine-grained prompts. To address this, we propose a two-stage decoupled framework augmented with a lightweight Spatial-Semantic Guidance Adapter (SSG-Adapter). In the first stage, we leverage a frozen video Diffusion Transformer backbone; in the second stage, a dual-branch attention mechanism jointly fuses text conditions and spatially aware features extracted from multimodal models. The SSG-Adapter injects joint spatial-textual conditioning in a parameter-efficient mannerโ€”without fine-tuning the backbone. This design significantly enhances modeling of spatial relationships and complex semantic details. Extensive experiments demonstrate state-of-the-art performance on multiple VBench metrics, particularly improving spatial relation controllability and overall generation consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.
Problem

Research questions and friction points this paper is trying to address.

Maintaining semantic consistency in controllable video generation
Addressing deviation from prompt details in video synthesis
Enhancing spatial relationship control and overall consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Signal Prompting for visual guidance
SSG-Adapter injects conditions into frozen backbone
Dual-branch attention mechanism combines generative priors
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Peng Hu
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Y
Yu Gu
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Liang Luo
Liang Luo
University of Washington
Systems for Machine LearningComputer SystemsComputer ArchitectureMachine Learning for Systems
Fuji Ren
Fuji Ren
Professor of University of Electronic Science and Technology of China
Artificial IntelligenceComputer ScienceAffective Computing