Semantic-Enriched Latent Visual Reasoning

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Existing latent-space visual reasoning methods suffer from semantically impoverished representations due to their reliance solely on visual supervision, limiting their capacity to support diverse region-level reasoning tasks. To address this, this work proposes SLVR, a two-stage framework that first learns semantically rich region-centric latent representations under fine-grained attribute supervision and then aligns multiple query-based latent representations of the same region via Multi-query Group Relative Policy Optimization (M-GRPO). This study is the first to integrate attribute-level semantics into latent-space reasoning, introducing both a novel multi-query alignment mechanism and a new benchmark, SV-QA. Evaluated on the newly curated SLV-Set dataset, SLVR demonstrates significant improvements over existing methods in semantic consistency and reasoning robustness.
📝 Abstract
Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.
Problem

Research questions and friction points this paper is trying to address.

latent visual reasoning
semantic enrichment
region-level reasoning
multimodal representation
visual semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent visual reasoning
semantic enrichment
region-level reasoning
multi-query alignment
attribute supervision
T
Tianrun Xu
Department of Automation, Tsinghua University, Beijing, China; Zhongguancun Academy, Beijing, China; WeChat Vision, Tencent Inc, Beijing, China
Y
Yue Sun
China Agricultural University, Beijing, China
Q
Qixun Wang
Peking University, Beijing, China
Jingyi Lu
Jingyi Lu
Hong Kong University of Science and Technology
process controloptimizationmodel predictive controliterative learning control
Y
Yuan Wang
WeChat Vision, Tencent Inc, Beijing, China; Department of Electronic Engineering, Tsinghua University, Beijing, China
Tianren Zhang
Tianren Zhang
Tsinghua University
Representation learningGeneralizationLearning theoryReinforcement learningMachine learning
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
F
Fengyun Rao
WeChat Vision, Tencent Inc, Beijing, China
Jing Lyu
Jing Lyu
Shanghai Jiao Tong University
Power electronicsstabilityrenewable energy grid integrationhigh-voltage dc transmission
Feng Chen
Feng Chen
Southwest University, Chongqing, China
signal processingdistributed estimationdistributed signal processingPoint Cloud
Jing Liu
Jing Liu
Institute of Theoretical Physics, Chinese Academy of Sciences
Statistical physicsmachine learning