ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation methods struggle to enable independent, fine-grained editing of human subjects and their environments, often facing a trade-off between reliance on 3D preprocessing and control flexibility. This work proposes a parameter-efficient, composable framework for human–environment video generation that decouples human motion from scene signals via a canonical-space injection mechanism and introduces Dynamic-Grounded-RoPE positional encoding, which operates without 3D alignment. By integrating spatially disentangled cross-attention with a hybrid context fusion strategy, the method supports minute-long video synthesis while significantly outperforming current approaches in both structural control accuracy and creative diversity, thereby achieving efficient and flexible co-generation of humans and environments.
📝 Abstract
Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.
Problem

Research questions and friction points this paper is trying to address.

compositional video synthesis
human-environment editing
spatial decoupling
generative flexibility
3D pre-processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-Decoupled Motion Injection
Dynamic-Grounded-RoPE
Hybrid Context Integration
Compositional Video Synthesis
Video Foundation Models
🔎 Similar Papers
No similar papers found.