ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of low-shot (10–20 demonstrations) robotic manipulation adaptation in real-world settings—namely, the large sim-to-real gap and insufficient generalization and scalability—this paper proposes a lightweight, controllable Vision-Language-Action (VLA) model adaptation framework. Methodologically, it introduces a ControlNet-style zero-initialized projection layer that conditions the frozen pre-trained VLA backbone on object-centric representations, thereby decoupling knowledge retention from task-specific modeling. The framework enables cross-object, cross-background, and long-horizon task transfer without fine-tuning the backbone. Evaluated on six real-world manipulation tasks—including pouring cubes and folding clothes—it achieves a 76.7% average success rate, substantially outperforming conventional methods requiring hundreds of demonstrations. This work establishes a new paradigm for data-efficient, highly generalizable, end-to-end VLA deployment in realistic robotic settings.

Technology Category

Application Category

📝 Abstract
Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.
Problem

Research questions and friction points this paper is trying to address.

Adapts pre-trained VLA models for few-shot robotic manipulation
Overcomes sim-to-real gaps without extensive demonstrations
Enables object-centric task adaptation with minimal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

ControlNet-style architecture for fine-tuning
Zero-initialized projection layers adaptation
Object-centric representations integration
🔎 Similar Papers
No similar papers found.
Puhao Li
Puhao Li
Ph.D. Student, Tsinghua University
Computer VisionRoboticsMachine Learning
Yingying Wu
Yingying Wu
Tsinghua University
Ziheng Xi
Ziheng Xi
Undergraduate Student of Department of Automation, Tsinghua University
Machine LearningDeep learningPattern Recognition
W
Wanlin Li
State Key Lab of General Artificial Intelligence, BIGAI
Y
Yuzhe Huang
State Key Lab of General Artificial Intelligence, BIGAI
Z
Zhiyuan Zhang
State Key Lab of General Artificial Intelligence, BIGAI
Y
Yinghan Chen
State Key Lab of General Artificial Intelligence, BIGAI
Jianan Wang
Jianan Wang
Astribot / IDEA / Deepmind / Oxford
Computer VisionGenerative AIRoboticsLearning Theory
S
Song-Chun Zhu
Tsinghua University
Tengyu Liu
Tengyu Liu
Beijing Institute for General Artificial Intelligence
computer visionhuman object interactionhuman motion generationgrasping
S
Siyuan Huang
State Key Lab of General Artificial Intelligence, BIGAI