SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Controllable video generation suffers from weak semantic consistency and difficulty in precisely responding to fine-grained prompts. To address this, we propose a two-stage decoupled framework augmented with a lightweight Spatial-Semantic Guidance Adapter (SSG-Adapter). In the first stage, we leverage a frozen video Diffusion Transformer backbone; in the second stage, a dual-branch attention mechanism jointly fuses text conditions and spatially aware features extracted from multimodal models. The SSG-Adapter injects joint spatial-textual conditioning in a parameter-efficient manner—without fine-tuning the backbone. This design significantly enhances modeling of spatial relationships and complex semantic details. Extensive experiments demonstrate state-of-the-art performance on multiple VBench metrics, particularly improving spatial relation controllability and overall generation consistency.

Technology Category

Application Category

📝 Abstract

Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.

Problem

Research questions and friction points this paper is trying to address.

Maintaining semantic consistency in controllable video generation

Addressing deviation from prompt details in video synthesis

Enhancing spatial relationship control and overall consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Signal Prompting for visual guidance

SSG-Adapter injects conditions into frozen backbone

Dual-branch attention mechanism combines generative priors

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Senior Applied ML Scientist – Generative Video

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence