Human Video Translation via Query Warping

📅 2024-02-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing video translation methods exhibit significantly weaker local structural preservation and temporal consistency compared to image-based models, frequently suffering from motion jitter and structural distortion. This paper proposes a diffusion-based framework for temporally consistent human action video translation. Its core innovation is the first introduction of Query Warping—a mechanism that explicitly aligns cross-frame query tokens via appearance flow and constrains self-attention outputs during denoising to ensure motion coherence. The approach jointly preserves local geometric fidelity and global temporal stability. Evaluated on multiple human action video translation benchmarks, our method achieves new state-of-the-art performance, with substantial improvements in quantitative metrics (e.g., FVD, LPIPS) and qualitative visual quality. It effectively mitigates motion jitter and structural artifacts, demonstrating superior spatiotemporal consistency.

Technology Category

Application Category

📝 Abstract

In this paper, we present QueryWarp, a novel framework for temporally coherent human motion video translation. Existing diffusion-based video editing approaches that rely solely on key and value tokens to ensure temporal consistency, which scarifies the preservation of local and structural regions. In contrast, we aim to consider complementary query priors by constructing the temporal correlations among query tokens from different frames. Initially, we extract appearance flows from source poses to capture continuous human foreground motion. Subsequently, during the denoising process of the diffusion model, we employ appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation. We perform experiments on various human motion video translation tasks, and the results demonstrate that our QueryWarp framework surpasses state-of-the-art methods both qualitatively and quantitatively.

Problem

Research questions and friction points this paper is trying to address.

Improving temporal coherence in zero-shot video translation

Enhancing feature aggregation through query patch warping

Achieving video translation without requiring additional model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenWarping warps query, key, value patches using optical flow

Enhances feature aggregation and temporal coherence in self-attention

Requires no training and integrates with existing image editing methods

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs