Mull-Tokens: Modality-Agnostic Latent Thinking

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal reasoning models rely on tool invocation, image generation, or manually constructed reasoning trajectories, resulting in rigid modality switching and poor scalability. This work proposes modality-agnostic implicit thinking units—“Mull-Tokens”—that unify textual and visual intermediate reasoning states within a shared latent space, enabling flexible cross-modal reasoning. Our method requires no external tools, image generation, or annotated reasoning paths; it performs unsupervised fine-tuning using only final answers. It integrates interleaved vision-language trajectory supervision during pretraining with joint optimization of latent reasoning and multimodal alignment. Evaluated on four spatial reasoning benchmarks, our approach achieves an average 3% improvement and up to 16% gain on challenging puzzle-heavy tasks, significantly outperforming both text-only and interleaved vision-language baselines. To our knowledge, this is the first lightweight, general-purpose, end-to-end trainable framework for modality-agnostic multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.
Problem

Research questions and friction points this paper is trying to address.

Enhances multimodal reasoning beyond language limitations
Improves spatial reasoning in puzzles and perspective tasks
Trains modality-agnostic tokens for flexible intermediate thinking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-agnostic latent tokens for flexible intermediate reasoning
Pre-trained with supervised interleaved text-image traces
Fine-tuned unsupervised using only final answer signals
🔎 Similar Papers
No similar papers found.