Semantic-Enriched Latent Visual Reasoning

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing latent-space visual reasoning methods suffer from semantically impoverished representations due to their reliance solely on visual supervision, limiting their capacity to support diverse region-level reasoning tasks. To address this, this work proposes SLVR, a two-stage framework that first learns semantically rich region-centric latent representations under fine-grained attribute supervision and then aligns multiple query-based latent representations of the same region via Multi-query Group Relative Policy Optimization (M-GRPO). This study is the first to integrate attribute-level semantics into latent-space reasoning, introducing both a novel multi-query alignment mechanism and a new benchmark, SV-QA. Evaluated on the newly curated SLV-Set dataset, SLVR demonstrates significant improvements over existing methods in semantic consistency and reasoning robustness.

📝 Abstract

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

Problem

Research questions and friction points this paper is trying to address.

latent visual reasoning

semantic enrichment

region-level reasoning

multimodal representation

visual semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual reasoning

semantic enrichment

region-level reasoning