Training-Time Action Conditioning for Efficient Real-Time Chunking

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high inference latency in vision-language-action (VLA) models for real-time chunked (RTC) control—caused by computationally expensive inpainting during execution—this work introduces *training-time action prefix conditioning*. Specifically, execution delay is explicitly modeled during training, and the sequence of already-executed actions is incorporated as a conditional input, thereby eliminating the need for runtime image inpainting entirely. This approach represents the first instance of shifting action conditioning from inference-time to training-time, requiring no architectural modifications or system-level changes—only minimal code adjustments. Evaluated on both simulation and real-robot tasks—including block stacking and espresso preparation—the method achieves task performance and response latency comparable to baselines while significantly reducing computational overhead and end-to-end latency.

Technology Category

Application Category

📝 Abstract

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in real-time robot trajectory generation

Eliminates inference latency by simulating delay during training

Maintains task performance without modifying model architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulating inference delay at training time

Conditioning on action prefixes directly

Eliminating inference-time computational overhead

🔎 Similar Papers

No similar papers found.