Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study addresses a critical yet previously unreported limitation in video large language models (Video-LLMs): their near-random performance on basic motion direction recognition tasks, which the authors term “directional motion blindness.” While these models can encode directional signals from visual inputs, they fail to effectively bind this information to their language outputs. To mitigate this issue, the authors propose DeltaDirect, a novel training objective that leverages inter-frame feature differences to predict normalized 2D motion vectors within the projection layer. They also introduce the MoDirect dataset family for systematic evaluation and training. Experiments demonstrate that DeltaDirect improves motion direction accuracy from 25.9% to 85.4% on MoDirect-SynBench and yields a 21.9-percentage-point gain on MoDirect-RealBench without fine-tuning on real-world data, all while preserving standard video understanding capabilities.

📝 Abstract

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

Problem

Research questions and friction points this paper is trying to address.

directional motion blindness

Video-LLMs

motion direction

temporal video understanding

visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

directional motion blindness

DeltaDirect

motion direction binding