LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit significant limitations in 3D spatial understanding and long-video comprehension. To address this, we propose LAST, a unified framework that jointly enhances both capabilities for the first time. LAST introduces learnable spatio-temporal visual reasoning trajectories—enabling deep 3D structural and temporal dynamic inference from 2D image inputs alone—without requiring specialized architectures. It supports both zero-shot transfer and trajectory-supervised fine-tuning. The trajectories are constructed via 3D geometric modeling and hierarchical temporal modeling, ensuring compatibility with general-purpose VLMs and large multimodal models (e.g., GPT-4o, Qwen2.5-VL) for collaborative reasoning. Experiments demonstrate consistent performance gains across three spatial, four video, and three image benchmarks: zero-shot GPT-4o achieves +15.8% on EgoSchema, and Qwen2.5-VL-7B improves by +8.3% on VSI-Bench.

Technology Category

Application Category

📝 Abstract

Humans can perceive and understand 3D space and long videos from sequential visual observations. But do vision-language models (VLMs) can? Recent work demonstrates that even state-of-the-art VLMs still struggle to understand 3D space and long videos, although they are powerful in typical vision-language tasks. Current methods often rely on specialized architectural designs to improve performance for 3D tasks and video understanding tasks separately. In contrast, we propose LAST, short for LeArn to Think in Space and Time, to jointly improve 3D spatial and long video understanding for general VLMs with only a set of 2D images as inputs. LAST makes VLMs think in space and time rather than only with text before giving the final answer, building visual thinking trajectories in 3D space and temporal dimension. We demonstrate the effectiveness of LAST in two scenarios: 1) zero-shot, where we directly prompt proprietary models; and 2) fine-tuning general VLMs with data that include thinking trajectories in 3D space and time. We show that LAST brings substantial gains in various benchmarks, including 3 spatial understanding, 4 video understanding, and 3 image understanding tasks. Notably, 15.8% gains on EgoSchema with GPT-4o in a zero-shot manner and 8.3 gains on VSI-Bench compared with Qwen2.5-VL-7B.

Problem

Research questions and friction points this paper is trying to address.

Improving 3D spatial understanding in vision-language models using 2D images

Enhancing long video comprehension for generalist vision-language models

Building visual thinking trajectories in spatial and temporal dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enables 3D and video understanding with 2D images

Builds visual thinking trajectories in space-time

Improves VLMs via zero-shot prompting and fine-tuning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs