Hybrid Training for Vision-Language-Action Models

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

While chain-of-thought (CoT) reasoning enhances decision-making in vision-language-action (VLA) models, it incurs substantial inference latency due to mandatory sequential CoT generation. Method: We propose HyT, a hybrid training framework that incorporates CoT supervision during training to improve reasoning capability, yet enables on-demand, mode-switchable inference—outputting actions, intermediate thoughts, or high-level instructions—without requiring real-time full-CoT generation. Contribution/Results: HyT introduces the first “train-with, discard-at-inference” CoT paradigm, integrating conditional generation modeling with a unified multimodal architecture. Evaluated across multiple simulated benchmarks and real-world robotic manipulation tasks, HyT maintains or improves task performance while significantly reducing end-to-end response latency—demonstrating both practical deployability and inference flexibility.

Technology Category

Application Category

📝 Abstract

Using Large Language Models to produce intermediate thoughts, a.k.a. Chain-of-thought (CoT), before providing an answer has been a successful recipe for solving complex language tasks. In robotics, similar embodied CoT strategies, generating thoughts before actions, have also been shown to lead to improved performance when using Vision-Language-Action models (VLAs). As these techniques increase the length of the model's generated outputs to include the thoughts, the inference time is negatively affected. Delaying an agent's actions in real-world executions, as in robotic manipulation settings, strongly affects the usability of a method, as tasks require long sequences of actions. However, is the generation of long chains-of-thought a strong prerequisite for achieving performance improvements? In this work, we explore the idea of Hybrid Training (HyT), a framework that enables VLAs to learn from thoughts and benefit from the associated performance gains, while enabling the possibility to leave out CoT generation during inference. Furthermore, by learning to conditionally predict a diverse set of outputs, HyT supports flexibility at inference time, enabling the model to either predict actions directly, generate thoughts or follow instructions. We evaluate the proposed method in a series of simulated benchmarks and real-world experiments.

Problem

Research questions and friction points this paper is trying to address.

Reducing inference time delays in vision-language-action models

Enabling performance gains without requiring chain-of-thought generation

Providing flexible inference options for robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Training enables thought-free VLA inference

Conditional prediction supports diverse output modes

Framework maintains performance while accelerating actions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs