FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing diffusion-based vision-language-action (VLA) policies rely on multi-billion-parameter architectures and massive datasets, incurring prohibitive computational costs that hinder real-world robot deployment. To address this, we propose an efficient VLA model featuring an intermediate modality fusion mechanism and action-specific Global-AdaLN conditioning modules—enabling significant parameter reduction. We further integrate LLM layer pruning with a modular diffusion architecture, enabling full pretraining within only 200 H100 GPU-hours. Our model contains just 950 million parameters yet achieves a state-of-the-art 4.53 success rate on the CALVIN ABC benchmark. Moreover, it matches or exceeds the performance of substantially larger models across 190 diverse tasks spanning both simulation and real-world robotic platforms. This work marks the first demonstration of a lightweight, high-performance, and strongly generalizable universal robot controller.

Technology Category

Application Category

📝 Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to $50%$ of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by $20%$ through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across $190$ tasks spanning ten simulation and real-world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark. Demos, code and pretrained weights are available at https://intuitive-robots.github.io/flower_vla/.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient vision-language-action policies for robotics

Reducing computational costs and resource requirements

Achieving competitive performance with smaller models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes 50% LLM layers for efficiency

Uses Global-AdaLN conditioning to cut parameters

Integrates advances into 950M-parameter VLA model

🔎 Similar Papers

Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

2024-10-10arXiv.orgCitations: 0

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Research Scientist Intern, Robotic Control Policy (PhD)