Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

While existing vision-language models (VLMs) report improved accuracy after incorporating chain-of-thought tokens, it remains unclear whether these tokens genuinely support reasoning or merely benefit from confounding factors such as increased context length. To address this ambiguity, this work introduces the Ablate-to-Validate diagnostic principle and proposes a standardized Token Replacement Test (TRT). TRT systematically replaces intermediate token semantics—via zeroing, randomization, repetition, or oracle substitution—while holding prompts, images, token counts, and decoding conditions fixed, thereby distinguishing between the mere presence of a potential reasoning pathway and its actual utilization. Experiments across diverse VLM architectures (e.g., LLaVA-13B, Qwen2.5-VL-3B), visual encoders (SigLIP, CLIP, DINOv2), and benchmarks (BLINK, VSP, CV-Bench) reveal that performance gains persist even when token content is corrupted, suggesting that current accuracy improvements are largely illusory.

📝 Abstract

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

continuous thought tokens

latent tokens

reasoning

model interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ablate-to-Validate

Token Replacement Test

vision-language models