HECTOR: Hybrid Editable Compositional Object References for Video Generation

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing video generation models struggle to achieve fine-grained, explicit spatiotemporal compositional control over multiple objects. To address this limitation, this work proposes the HECTOR framework, which introduces— for the first time—a hybrid reference guidance mechanism combining static images and dynamic videos, along with an explicit object trajectory control module. By leveraging trajectory-driven spatiotemporal alignment and a composable generative architecture, HECTOR enables independent and precise manipulation of each object’s position, scale, and velocity. The method significantly outperforms current state-of-the-art approaches in terms of visual quality, reference fidelity, and motion controllability, successfully generating high-fidelity, temporally coherent videos that satisfy complex spatiotemporal semantic constraints.

Technology Category

Application Category

📝 Abstract

Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

Problem

Research questions and friction points this paper is trying to address.

video generation

compositional control

object manipulation

spatiotemporal constraints

reference conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional control

hybrid reference conditioning

trajectory specification