Semantic Frame Interpolation

📅 2025-07-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional frame interpolation methods suffer from fixed output frame counts, weak text controllability, and an implicit assumption of small-motion transitions, while lacking a unified problem formulation and standardized benchmark. To address these limitations, we propose Semantic Frame Interpolation (SFI): given a pair of start/end frames and a textual prompt, the task is to generate a temporally coherent intermediate video with arbitrary frame count and frame rate, preserving semantic consistency throughout. We introduce SFI-300K—the first large-scale dataset and evaluation benchmark for SFI—and design a Mixture-of-LoRA module to enable robust, length-agnostic, and semantically consistent generation. Built upon the Wan2.1 architecture, our approach integrates LoRA-based fine-tuning, text-conditioned guidance, and a multi-dimensional evaluation framework. Extensive experiments demonstrate substantial improvements over prior methods in frame-count flexibility, text-video alignment, and visual temporal coherence, establishing a new paradigm and practical foundation for controllable video generation.

Technology Category

Application Category

📝 Abstract
Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model's performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.
Problem

Research questions and friction points this paper is trying to address.

Generating intermediate video frames with text control
Overcoming fixed frame count limitations in interpolation
Establishing benchmark for semantic frame interpolation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Frame Interpolation with Mixture-of-LoRA
SFI-300K dataset for multi-frame benchmarks
Multi-frame rate inference with high consistency
🔎 Similar Papers
No similar papers found.
Y
Yijia Hong
Shanghai Jiao Tong University
J
Jiangning Zhang
Zhejiang University, Tencent YouTu Lab
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics
Y
Yuji Wang
Shanghai Jiao Tong University
Weijian Cao
Weijian Cao
Tencent
CVCG
Xiaobin Hu
Xiaobin Hu
Tencent Youtu Lab;Technische Universität München (TUM)
Deep learningComputer visionVLMAgents
Z
Zhucun Xue
Zhejiang University
Y
Yabiao Wang
Tencent YouTu Lab
C
Chengjie Wang
Tencent YouTu Lab
L
Lizhuang Ma
Shanghai Jiao Tong University