LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

253K/year

🤖 AI Summary

This work addresses the challenge of cascading failures in general-purpose robots performing long-horizon, multi-stage manipulation tasks in unstructured environments, where complex skill compositions and environmental sensitivity often lead to unrecoverable errors. To this end, the authors propose LiLo-VLA, a modular framework that decouples tasks into global transport (Reaching Module) and local object-centric interaction (Object-Centric VLA Interaction Module), enabling zero-shot generalization. The framework integrates Vision-Language-Action models, object-centric policies, and a dynamic replanning mechanism, significantly enhancing robustness to task-irrelevant visual features, invariance to spatial configurations, and failure recovery capabilities. Evaluated across 21 simulated tasks, LiLo-VLA achieves an average success rate of 69%, outperforming Pi0.5 by 41% and OpenVLA-OFT by 2%; it further demonstrates strong real-world performance with an 85% success rate on eight long-horizon physical tasks.

Technology Category

Application Category

📝 Abstract

General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.

Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation

Vision-Language-Action models

combinatorial complexity

cascading failures

kinematic structure changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular VLA

object-centric policy

long-horizon manipulation