Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video large language models (Video LLMs) struggle with fine-grained spatiotemporal reference and reasoning in dynamic, real-world scenarios—particularly when resolving spatial ambiguity via temporal event anchoring or gesture cues. To address this, we propose Strefer, a novel framework featuring a synthetic instruction data engine that requires no human annotation. It automatically generates dense, temporally aligned pseudo-labels from raw video metadata, producing structured training data encompassing objects, spatial locations, actions, and temporal trajectories. Our method integrates masklets, action descriptions, and explicit temporal modeling to enable precise spatiotemporal localization and semantic association between subjects and objects. Experiments demonstrate that models trained with Strefer significantly outperform baselines on spatiotemporal disambiguation tasks, without relying on proprietary architectures or large-scale manual annotations. Strefer establishes a scalable, low-cost paradigm for spatiotemporal-aware video understanding.

Technology Category

Application Category

📝 Abstract
Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
Problem

Research questions and friction points this paper is trying to address.

Resolving spatial and temporal references in dynamic environments
Addressing fine-grained spatiotemporal reasoning in Video LLMs
Enhancing interpretation of time-based events and gestural cues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic instruction data generation framework
Pseudo-annotates dense spatiotemporal video metadata
Enhances spatial-temporal reasoning without human annotation
🔎 Similar Papers
No similar papers found.