VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the fragmented nature of existing open-source vision-language-action (VLA) model training pipelines, which lack a unified end-to-end framework. The authors propose the first open-source, full-stack training pipeline that seamlessly integrates large language models (LLMs), vision-language models (VLMs), and VLA policies within a modular architecture. This framework supports both training from scratch and leveraging Hugging Face pre-trained backbones—such as Qwen3-VL—enabling unified training across language pre-training, vision-language alignment, and action policy fine-tuning. The approach substantially enhances the usability of simulation environments and analytical tools. Evaluated on multi-task tabletop manipulation benchmarks, the Qwen3-VL-based model significantly outperforms current open-source baselines and achieves performance comparable to prior closed-source systems.

Technology Category

Application Category

📝 Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

Unified Training Framework

End-to-End Control

Open-Source VLA

Pretraining Pipeline Compatibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Unified Training Framework

End-to-End VLA Pipeline