Dexbotic: Open-Source Vision-Language-Action Toolbox

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Addressing the challenges of poor reproducibility and fragmented development in embodied intelligence research, this paper introduces an open-source, PyTorch-native Visual-Language-Action (VLA) model toolbox. Designed with an experiment-driven architecture, it uniformly supports major VLA paradigms—including BC-Z, RT-2, and OpenVLA—enabling one-click reproduction and modular extension. The toolbox integrates standardized interfaces to pre-trained vision-language foundation models, facilitating seamless incorporation of state-of-the-art backbones. It further provides a unified evaluation protocol and plug-and-play action decoding modules. We successfully reproduce six SOTA methods and validate the toolbox on benchmarks such as BridgeData v2: fine-tuning with stronger foundation models yields consistent performance gains, improving average task success rate by +4.2%. All code, trained models, and experimental scripts are publicly released.

Technology Category

Application Category

📝 Abstract

In this paper, we present Dexbotic, an open-source Vision-Language-Action (VLA) model toolbox based on PyTorch. It aims to provide a one-stop VLA research service for professionals in the field of embodied intelligence. It offers a codebase that supports multiple mainstream VLA policies simultaneously, allowing users to reproduce various VLA methods with just a single environment setup. The toolbox is experiment-centric, where the users can quickly develop new VLA experiments by simply modifying the Exp script. Moreover, we provide much stronger pretrained models to achieve great performance improvements for state-of-the-art VLA policies. Dexbotic will continuously update to include more of the latest pre-trained foundation models and cutting-edge VLA models in the industry.

Problem

Research questions and friction points this paper is trying to address.

Provides an open-source toolbox for Vision-Language-Action research

Supports multiple VLA policies with single environment setup

Offers pretrained models to improve state-of-the-art VLA performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolbox for vision-language-action models

Supports multiple VLA policies with single environment

Provides stronger pretrained models for performance improvements

🔎 Similar Papers

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs