Multimodal Latent Reasoning via Predictive Embeddings

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing tool-augmented multimodal reasoning approaches suffer from high inference overhead, reliance on specialized supervision, and susceptibility to erroneous tool invocation. This work proposes Pearl, a framework inspired by JEPA that learns predictive embeddings of expert tool-use trajectories directly in latent space, eliminating the need for explicit tool calls during inference and thereby preserving the standard vision-language generation pipeline. Pearl eschews reconstructive latent reasoning, supports multi-step tool usage, avoids train-inference mismatch, and is both model-agnostic and training-efficient. Experiments demonstrate that Pearl matches or surpasses current supervised fine-tuning and reconstruction-based methods across multiple perception benchmarks, further revealing that reconstruction-based approaches effectively learn embeddings rather than performing genuine image editing.

📝 Abstract

Tool-augmented multimodal reasoning enables visual language models (VLMs) to improve perception by interacting with external tools (e.g., cropping, depth estimation). However, such approaches incur substantial inference overhead, require specialized supervision, and are prone to erroneous tool calls. We propose Pearl (Predictive Embedding Alignment for Reasoning in Latent space), a JEPA-inspired framework that learns from expert tool-use trajectories entirely in the latent space, eliminating the need for explicit tool invocation at inference time. Unlike reconstruction-based latent reasoning methods, which autoregressively generate latent tokens and suffer from training-inference mismatch and limited support for multi-step tool use, Pearl directly learns predictive embeddings from multimodal trajectories while preserving the standard vision-language generation pipeline: it is model-agnostic, simple to train, and naturally supports trajectories with multiple tool calls. Experiments across multiple perception benchmarks show that Pearl matches or outperforms standard supervised fine-tuning and reconstruction-based latent reasoning approaches. Furthermore, we provide empirical evidence that reconstruction-based methods primarily learn embeddings rather than image edits in latent space, motivating predictive embedding learning as a more principled alternative.

Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning

tool-augmented perception

latent space

inference overhead

training-inference mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

predictive embeddings

latent reasoning

tool-augmented multimodal reasoning

JEPA-inspired framework

model-agnostic

🔎 Similar Papers

No similar papers found.

Authors to Follow