🤖 AI Summary
To address the high inference latency in vision-language-action (VLA) models for real-time chunked (RTC) control—caused by computationally expensive inpainting during execution—this work introduces *training-time action prefix conditioning*. Specifically, execution delay is explicitly modeled during training, and the sequence of already-executed actions is incorporated as a conditional input, thereby eliminating the need for runtime image inpainting entirely. This approach represents the first instance of shifting action conditioning from inference-time to training-time, requiring no architectural modifications or system-level changes—only minimal code adjustments. Evaluated on both simulation and real-robot tasks—including block stacking and espresso preparation—the method achieves task performance and response latency comparable to baselines while significantly reducing computational overhead and end-to-end latency.
📝 Abstract
Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.