UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the inefficiency and modality disconnection inherent in existing visual implicit reasoning methods, which typically alternate between textual chain-of-thought and visual latent variables. To overcome these limitations, the authors propose a unified visual implicit reasoning framework that eliminates the need for a separate textual reasoning pathway. Instead, textual semantics and visual evidence are jointly embedded into a shared visual workspace, enabling end-to-end answer generation directly from compressed visual latent representations. This approach achieves, for the first time, a unified latent-space representation and efficient joint reasoning over text and vision without relying on external tool calls or lengthy textual generation. Experimental results demonstrate that the model significantly outperforms current methods on real-world perception and visual reasoning tasks while substantially reducing the number of generated tokens, thereby validating its efficiency and representational unity.

📝 Abstract

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

visual latent reasoning

multimodal LLMs

chain-of-thought

text-vision integration

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual latent reasoning

multimodal LLMs

unified reasoning