Training-Time Action Conditioning for Efficient Real-Time Chunking

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference latency in vision-language-action (VLA) models for real-time chunked (RTC) control—caused by computationally expensive inpainting during execution—this work introduces *training-time action prefix conditioning*. Specifically, execution delay is explicitly modeled during training, and the sequence of already-executed actions is incorporated as a conditional input, thereby eliminating the need for runtime image inpainting entirely. This approach represents the first instance of shifting action conditioning from inference-time to training-time, requiring no architectural modifications or system-level changes—only minimal code adjustments. Evaluated on both simulation and real-robot tasks—including block stacking and espresso preparation—the method achieves task performance and response latency comparable to baselines while significantly reducing computational overhead and end-to-end latency.

Technology Category

Application Category

📝 Abstract
Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in real-time robot trajectory generation
Eliminates inference latency by simulating delay during training
Maintains task performance without modifying model architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulating inference delay at training time
Conditioning on action prefixes directly
Eliminating inference-time computational overhead
🔎 Similar Papers
No similar papers found.