Task Reconstruction and Extrapolation for $pi_0$ using Text Latent

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language-action (VLA) models exhibit limited out-of-distribution task generalization and compositional skill recombination—struggling to flexibly recombine learned skills into novel task configurations. Method: We propose a text latent-state temporal interpolation mechanism for task reconfiguration, enabling the first explicit manipulation of internal language representations in VLAs to support cross-task behavioral composition. Our approach leverages latent-space editing, token-level averaging and interpolation, and reverse decoding of textual latents. Contribution/Results: We uncover an inherent spatial overfitting bias in VLAs and discover that high-functionality yet semantically uninterpretable “private instructions” can be decoded from their latent space. On LIBERO-OOD, our method boosts state-of-the-art VLA success rates from <15% to 83%. Remarkably, these uninterpretable prompts achieve 70% success on standard LIBERO, establishing a new paradigm for enhancing VLA generalization and enabling novel safety-aware analysis.

Technology Category

Application Category

📝 Abstract
Vision-language-action models (VLAs) often achieve high performance on demonstrated tasks but struggle significantly when required to extrapolate, combining skills learned from different tasks in novel ways. For instance, VLAs might successfully put the cream cheese in the bowl and put the bowl on top of the cabinet, yet still fail to put the cream cheese on top of the cabinet. In this work, we demonstrate that behaviors from distinct tasks can be effectively recombined by manipulating the VLA's internal representations at inference time. Concretely, we identify the text latent by averaging the text tokens' hidden states across all demonstrated trajectories for a specific base task. For executing an extrapolated task, we can temporally interpolate the text latent of the two base tasks and add it back to the text hidden states, so sub-behaviors from the two tasks will be activated sequentially. We evaluate this approach using the newly created libero-ood benchmark, featuring 20 tasks extrapolated from standard LIBERO suites. The results on libero-ood show that all SOTA VLAs achieve<15% success rate, while $pi0$ with text latent interpolation reaches an 83% success rate. Further qualitative analysis reveals a tendency for VLAs to exhibit spatial overfitting, mapping object names to demonstrated locations rather than achieving genuine object and goal understanding. Additionally, we find that decoding the text latent yields human-unreadable prompts that can nevertheless instruct the VLA to achieve a 70% success rate on standard LIBERO suites, enabling private instruction or backdoor attacks.
Problem

Research questions and friction points this paper is trying to address.

VLAs struggle to combine skills for novel tasks
Spatial overfitting limits genuine object understanding
Text latent interpolation improves task extrapolation success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manipulate VLA internal representations for recombination
Temporally interpolate text latents for sequential activation
Decode text latents into effective private instructions
🔎 Similar Papers
No similar papers found.