PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models struggle to generate long-duration human videos with stable identity and precise motion, suffering from identity drift and temporal incoherence. This paper proposes a single-image-plus-pose-sequence method for infinite-length human video generation. First, we design an in-context LoRA fine-tuning strategy that injects appearance features at the token level and embeds pose conditions at the channel level. Second, we introduce a novel interleaved segment-wise generation scheme with shared KV caching to ensure cross-segment temporal consistency and seamless concatenation. We further enhance coherence via transition-frame optimization and cross-attention control. Trained on only 33 hours of modest-scale data, our method significantly outperforms state-of-the-art approaches in identity fidelity, pose accuracy, and temporal coherence. It enables high-fidelity, artifact-free synthesis of human motion videos of arbitrary length.

Technology Category

Application Category

📝 Abstract
Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.
Problem

Research questions and friction points this paper is trying to address.

Generating long, identity-preserving human videos from single image
Preventing identity drift and motion control in diffusion models
Ensuring temporal coherence in unlimited-length video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-context LoRA finetuning for identity preservation
Interleaved segment generation for long videos
Shared KV cache for seamless video stitching
🔎 Similar Papers
No similar papers found.