Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge that existing video understanding methods struggle to effectively model dynamic changes due to sparse temporal signals in real-world datasets and temporal biases in synthetic data. To overcome this, the authors propose a novel paradigm that leverages procedurally generated synthetic videos—using simple geometric shapes as carriers—to explicitly teach models transferable temporal primitives such as direction, velocity, and state tracking. Temporal understanding is decoupled into short-term perception and long-term cognition. The approach employs code-generated data comprising 7.7K chain-of-thought examples and 7K reinforcement learning samples, all accompanied by frame-level precise annotations for post-training. Remarkably, using only 7.7K synthetic samples, this method significantly outperforms Video-R1—which relies on 165K real-world samples—across 15 video understanding benchmarks, demonstrating the high efficiency and strong transferability of learning fundamental temporal skills from abstract synthetic data.

Technology Category

Application Category

📝 Abstract

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

Problem

Research questions and friction points this paper is trying to address.

temporal reasoning

video understanding

vision-language models

synthetic data

temporal dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal primitives

synthetic video generation

video reasoning