Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
While existing vision-language models (VLMs) report improved accuracy after incorporating chain-of-thought tokens, it remains unclear whether these tokens genuinely support reasoning or merely benefit from confounding factors such as increased context length. To address this ambiguity, this work introduces the Ablate-to-Validate diagnostic principle and proposes a standardized Token Replacement Test (TRT). TRT systematically replaces intermediate token semantics—via zeroing, randomization, repetition, or oracle substitution—while holding prompts, images, token counts, and decoding conditions fixed, thereby distinguishing between the mere presence of a potential reasoning pathway and its actual utilization. Experiments across diverse VLM architectures (e.g., LLaVA-13B, Qwen2.5-VL-3B), visual encoders (SigLIP, CLIP, DINOv2), and benchmarks (BLINK, VSP, CV-Bench) reveal that performance gains persist even when token content is corrupted, suggesting that current accuracy improvements are largely illusory.
📝 Abstract
Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.
Problem

Research questions and friction points this paper is trying to address.

vision-language models
continuous thought tokens
latent tokens
reasoning
model interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ablate-to-Validate
Token Replacement Test
vision-language models
continuous thought tokens
latent token reasoning