Leave No Observation Behind: Real-time Correction for VLA Action Chunks

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and poor responsiveness in Vision-Language-Action (VLA) models caused by autoregressive action chunking, this paper proposes the Asynchronous Action Chunk Correction (A2C2) framework. A2C2 requires no retraining of the base model; instead, it introduces a lightweight correction head that dynamically refines each predicted action chunk by fusing real-time visual observations, the original chunk prediction, positional encodings, and policy-aware features—enabling temporal-aware, stepwise action adaptation. The framework is inherently compatible with asynchronous execution paradigms such as Real-Time Control (RTC), significantly improving closed-loop responsiveness. On the Kinetix and LIBERO Spatial benchmarks, A2C2 achieves +23% and +7% success rate improvements over RTC, respectively. Moreover, it enhances robustness for long-horizon tasks even without artificial latency injection, while imposing negligible computational overhead.

Technology Category

Application Category

📝 Abstract
To improve efficiency and temporal coherence, Vision-Language-Action (VLA) models often predict action chunks; however, this action chunking harms reactivity under inference delay and long horizons. We introduce Asynchronous Action Chunk Correction (A2C2), which is a lightweight real-time chunk correction head that runs every control step and adds a time-aware correction to any off-the-shelf VLA's action chunk. The module combines the latest observation, the predicted action from VLA (base action), a positional feature that encodes the index of the base action within the chunk, and some features from the base policy, then outputs a per-step correction. This preserves the base model's competence while restoring closed-loop responsiveness. The approach requires no retraining of the base policy and is orthogonal to asynchronous execution schemes such as Real Time Chunking (RTC). On the dynamic Kinetix task suite (12 tasks) and LIBERO Spatial, our method yields consistent success rate improvements across increasing delays and execution horizons (+23% point and +7% point respectively, compared to RTC), and also improves robustness for long horizons even with zero injected delay. Since the correction head is small and fast, there is minimal overhead compared to the inference of large VLA models. These results indicate that A2C2 is an effective, plug-in mechanism for deploying high-capacity chunking policies in real-time control.
Problem

Research questions and friction points this paper is trying to address.

Corrects action chunk delays in Vision-Language-Action models for real-time control
Improves reactivity and robustness under inference delays and long horizons
Provides plug-in correction without retraining base policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time correction head for action chunks
Uses latest observation and positional features
No retraining required for base policy
🔎 Similar Papers
No similar papers found.