YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing DiT-based video object removal methods perform dense computation over all spatiotemporal tokens, resulting in high inference latency and an inability to adaptively accelerate processing based on mask regions. This work proposes YOSE, an efficient fine-tuning framework that leverages mask-guided dynamic selection of critical tokens. YOSE introduces a differentiable Batched Variable-length Indexing (BVI) operator and a Diffusion Process Simulator (DiffSim), enabling, for the first time in DiT architectures, mask-aware variable-length token processing while preserving semantic consistency. Combined with self-attention approximation techniques, YOSE achieves up to 2.5× speedup in 70% of test cases without compromising visual quality relative to the baseline.

📝 Abstract

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.

Problem

Research questions and friction points this paper is trying to address.

video object removal

Diffusion Transformer

inference latency

spatiotemporal tokens

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

essential token selection

mask-aware acceleration

Diffusion Transformer