Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of current large vision-language models (LVLMs) in fine-grained spatial localization and explicit spatial reasoning. The authors propose a novel approach that integrates explicit 2D and 3D spatial tokens into the autoregressive LVLM inference pipeline by first generating semantic segmentation masks (based on SAM2) and depth tokens (derived from a VQVAE) before answering questions. This framework establishes a spatial chain-of-thought through a composite depth token objective, a soft fusion-based reconstruction mechanism, multi-task collaborative training, and a newly designed depth-aware loss function. Experimental results demonstrate consistent improvements across multiple benchmarks: cIoU scores increase by 0.8, 1.4, and 1.1 on RefCOCO, RefCOCO+, and RefCOCog, respectively; spatial understanding accuracy on HardBLINK improves by 10.3%; and overall performance on MMBench rises by 1.0%.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

Problem

Research questions and friction points this paper is trying to address.

spatial grounding

vision language models

fine-grained perception

spatial reasoning

semantic understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial token generation

perception-enhanced LVLM

VQ-VAE depth tokenization