🤖 AI Summary
Existing diffusion models struggle to generate long-duration human videos with stable identity and precise motion, suffering from identity drift and temporal incoherence. This paper proposes a single-image-plus-pose-sequence method for infinite-length human video generation. First, we design an in-context LoRA fine-tuning strategy that injects appearance features at the token level and embeds pose conditions at the channel level. Second, we introduce a novel interleaved segment-wise generation scheme with shared KV caching to ensure cross-segment temporal consistency and seamless concatenation. We further enhance coherence via transition-frame optimization and cross-attention control. Trained on only 33 hours of modest-scale data, our method significantly outperforms state-of-the-art approaches in identity fidelity, pose accuracy, and temporal coherence. It enables high-fidelity, artifact-free synthesis of human motion videos of arbitrary length.
📝 Abstract
Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.