Generating Fit Check Videos with a Handheld Camera

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the challenge of generating high-fidelity full-body motion videos from single-smartphone handheld capture—without fixed cameras, complex setups, or repeated rehearsals—this paper proposes an end-to-end framework. Methodologically, it fuses front- and rear-camera self-captured images with IMU motion priors to enable parameter-free frame generation; introduces a multi-reference attention mechanism for cross-view appearance alignment; incorporates an image-driven diffusion fine-tuning module to enhance frame sharpness and realism of shadows and specular reflections; and employs joint lighting-geometry rendering to ensure cross-scene consistency. Experiments demonstrate new state-of-the-art performance in pose coherence, dynamic shadow modeling, and specular reflection synthesis. The method significantly improves photorealism and generalization capability of full-body motion video generation under unconstrained mobile capture conditions.

Technology Category

Application Category

📝 Abstract

Self-captured full-body videos are popular, but most deployments require mounted cameras, carefully-framed shots, and repeated practice. We propose a more convenient solution that enables full-body video capture using handheld mobile devices. Our approach takes as input two static photos (front and back) of you in a mirror, along with an IMU motion reference that you perform while holding your mobile phone, and synthesizes a realistic video of you performing a similar target motion. We enable rendering into a new scene, with consistent illumination and shadows. We propose a novel video diffusion-based model to achieve this. Specifically, we propose a parameter-free frame generation strategy, as well as a multi-reference attention mechanism, that effectively integrate appearance information from both the front and back selfies into the video diffusion model. Additionally, we introduce an image-based fine-tuning strategy to enhance frame sharpness and improve the generation of shadows and reflections, achieving a more realistic human-scene composition.

Problem

Research questions and friction points this paper is trying to address.

Enables full-body video capture using handheld mobile devices

Synthesizes realistic videos from static photos and IMU motion

Enhances video realism with illumination, shadows, and reflections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Handheld mobile devices for full-body video capture

Video diffusion-based model with multi-reference attention

Image-based fine-tuning for realistic human-scene composition

🔎 Similar Papers

No similar papers found.