ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the challenges of low-shot (10–20 demonstrations) robotic manipulation adaptation in real-world settings—namely, the large sim-to-real gap and insufficient generalization and scalability—this paper proposes a lightweight, controllable Vision-Language-Action (VLA) model adaptation framework. Methodologically, it introduces a ControlNet-style zero-initialized projection layer that conditions the frozen pre-trained VLA backbone on object-centric representations, thereby decoupling knowledge retention from task-specific modeling. The framework enables cross-object, cross-background, and long-horizon task transfer without fine-tuning the backbone. Evaluated on six real-world manipulation tasks—including pouring cubes and folding clothes—it achieves a 76.7% average success rate, substantially outperforming conventional methods requiring hundreds of demonstrations. This work establishes a new paradigm for data-efficient, highly generalizable, end-to-end VLA deployment in realistic robotic settings.

Technology Category

Application Category

📝 Abstract

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

Problem

Research questions and friction points this paper is trying to address.

Adapts pre-trained VLA models for few-shot robotic manipulation

Overcomes sim-to-real gaps without extensive demonstrations

Enables object-centric task adaptation with minimal data

Innovation

Methods, ideas, or system contributions that make the work stand out.

ControlNet-style architecture for fine-tuning

Zero-initialized projection layers adaptation

Object-centric representations integration

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)