🤖 AI Summary
Existing methods struggle to achieve low-latency, interactive garment-level human video customization using only single-outfit video data while maintaining both motion coherence and real-time performance. This work proposes FashionChameleon, a framework that enables real-time outfit switching during inference through autoregressive video generation while preserving motion consistency. Its key innovations include interactive multi-outfit customization without requiring multi-outfit training data, and a training-agnostic KV cache rescheduling mechanism—comprising garment KV refreshing, historical KV rollback, and reference KV decoupling—that ensures motion coherence without retraining. The approach further integrates in-context learning with a teacher model and streaming distillation for optimization. FashionChameleon supports temporally consistent long-video extrapolation and achieves 23.8 FPS on a single GPU, offering a 30–180× speedup over current baselines.
📝 Abstract
Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.