🤖 AI Summary
Existing latent-space visual reasoning methods suffer from semantically impoverished representations due to their reliance solely on visual supervision, limiting their capacity to support diverse region-level reasoning tasks. To address this, this work proposes SLVR, a two-stage framework that first learns semantically rich region-centric latent representations under fine-grained attribute supervision and then aligns multiple query-based latent representations of the same region via Multi-query Group Relative Policy Optimization (M-GRPO). This study is the first to integrate attribute-level semantics into latent-space reasoning, introducing both a novel multi-query alignment mechanism and a new benchmark, SV-QA. Evaluated on the newly curated SLV-Set dataset, SLVR demonstrates significant improvements over existing methods in semantic consistency and reasoning robustness.
📝 Abstract
Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.