Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the susceptibility of video large language models to hallucinations in dynamic scenes, primarily due to insufficient explicit spatiotemporal modeling of object identities, states, and relationships over time. To mitigate this, the authors propose STEMO-Track, a novel framework that introduces explicit object trajectory modeling into video large language models for the first time. By integrating structured trajectory construction, chunked state extraction, and temporal aggregation mechanisms, STEMO-Track enables object-centric explicit spatiotemporal reasoning. Additionally, the authors introduce STEMO-Bench, the first human-verified benchmark specifically designed for evaluating object-centric factual consistency at a fine-grained level. Experimental results demonstrate that the proposed approach significantly reduces hallucination rates and outperforms state-of-the-art models in complex dynamic scenarios, achieving improved spatiotemporal reasoning consistency.

📝 Abstract

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal monitoring

video understanding

hallucination

object tracking

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-centric reasoning

spatio-temporal monitoring

video large language models