🤖 AI Summary
This work addresses the underutilization of visually grounded latent variables in current vision-language models, which—despite their rich semantic content—are systematically suppressed during inference. The study identifies this phenomenon as “silent visual latent variables” and introduces a novel inference-time optimization mechanism that requires no updates to the backbone parameters. By leveraging query-guided contrastive alignment between latent variables and visual features, coupled with a confidence-progressive reward scheme, the method enhances latent semantic quality and steers the prediction pathway in two stages while keeping the backbone frozen. Evaluated across eight benchmarks and four model backbones, the approach consistently achieves significant gains in multimodal reasoning performance, effectively unlocking the previously suppressed inferential capacity of visual latent variables.
📝 Abstract
Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.