TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models struggle to emulate human visual attention trajectories, resulting in weak alignment between generated descriptions and image regions and limited interpretability. To address this, this work proposes an end-to-end trajectory-aware vision-language model that introduces a novel Trajectory-aware Visual Perception (TVP) module, geometrically simplified keypoint extraction, and a three-stage training strategy. The approach is further extended to trajectory-guided segmentation and video temporal understanding. Additionally, the newly constructed RILN dataset enhances the model’s logical reasoning capabilities. The proposed method achieves state-of-the-art performance across multiple tasks—including trajectory-guided captioning, text-guided trajectory prediction, and region understanding with segmentation—laying a foundation for human-like spatial comprehension and interpretable visual interaction.

Technology Category

Application Category

📝 Abstract
Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
Problem

Research questions and friction points this paper is trying to address.

visual attention trajectory
spatial understanding
vision-language model
region-description association
human-like perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-aware
vision-language model
visual attention trajectory
geometric simplification
interpretable visual understanding
🔎 Similar Papers
No similar papers found.
Fan Yang
Fan Yang
CASIA&PCL
Domain adaptionObject detection
S
Shurong Zheng
Foundation Model Research Center, Institute of Automation; Peng Cheng Laboratory, Shenzhen, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China
H
Hongyin Zhao
School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China
Yufei Zhan
Yufei Zhan
Institute of Automation, Chinese Academy of Science
Computer VisionLarge Multimodal ModelsGrounding and Detection
Xin Li
Xin Li
Pengcheng Laboratory
Computer VisionMachine Learning
Yousong Zhu
Yousong Zhu
Associate Professor, Chinese Academy of Sciences, Institute of Automation
Multimodal Large Language ModelsSelf-supervised LearningObject Detection
Chaoyang Zhao
Chaoyang Zhao
Institute of Automation, Chinese Academy of Sciences
computer vision
M
Ming Tang
School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China
J
Jinqiao Wang
Foundation Model Research Center, Institute of Automation; Peng Cheng Laboratory, Shenzhen, China; School of Artificial Intelligence, University of Chinese Academy of Science, Beijing, China; Wuhan AI Research, Wuhan, China