Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the challenge of long-horizon robotic manipulation, which requires joint reasoning over semantic coherence and geometric constraints—a balance that existing vision-language-action methods struggle to achieve. The authors propose an Interleaved Vision-Language Reasoning (IVLR) framework that introduces, for the first time, an explicit intermediate representation composed of alternating textual subgoals and visual keyframes, enabling unified planning over semantic intent and spatial detail. Built upon a native multimodal Transformer, IVLR autonomously generates reasoning trajectories and leverages a vision-language model to temporally segment demonstration data and produce stage-wise descriptions, forming a pseudo-supervised training mechanism. Evaluated on the LIBERO benchmark, IVLR achieves an average success rate of 95.5% (92.4% on LIBERO-Long), substantially outperforming approaches relying solely on textual or visual trajectories and demonstrating the efficacy of interleaved trajectory representations.
📝 Abstract
Long-horizon robotic manipulation requires plans that are both logically coherent and geometrically grounded. Existing Vision-Language-Action policies usually hide planning in latent states or expose only one modality: text-only chain-of-thought encodes causal order but misses spatial constraints, while visual prediction provides geometric cues but often remains local and semantically underconstrained. We introduce Interleaved Vision--Language Reasoning (IVLR), a policy framework built around \trace{}, an explicit intermediate representation that alternates textual subgoals with visual keyframes over the full task horizon. At test time, a single native multimodal transformer self-generates this global semantic-geometric trace from the initial observation and instruction, caches it, and conditions a closed-loop action decoder on the trace, original instruction, and current observation. Because standard robot datasets lack such traces, we construct pseudo-supervision by temporally segmenting demonstrations and captioning each stage with a vision-language model. Across simulated benchmarks for long-horizon manipulation and visual distribution shift, \method{} reaches 95.5\% average success on LIBERO, including 92.4\% on LIBERO-Long, and 59.4\% overall success on SimplerEnv-WidowX. Ablations show that both modalities are necessary: without traces, LIBERO-Long success drops to 37.7\%; text-only and vision-only traces reach 62.0\% and 68.4\%, while the full interleaved trace reaches 92.4\%. Stress tests with execution perturbations and masked trace content show moderate degradation, suggesting that the trace can tolerate local corruption and moderate execution drift, but remains limited under stale or incorrect global plans.
Problem

Research questions and friction points this paper is trying to address.

long-horizon robot manipulation
vision-language reasoning
planning coherence
geometric grounding
multimodal representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved Vision-Language Reasoning
multimodal planning trace
long-horizon robot manipulation
semantic-geometric grounding
vision-language-action policy