DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Asynchronous inference severely degrades the performance of vision-language-action (VLA) policies due to state misalignment. This work proposes an offline post-training method that, for the first time, converts inference latency into an unsupervised preference signal: by constructing counterfactual action pairs and scoring them via flow-matching likelihood ratios, the policy is fine-tuned without human annotations, reward models, or online interaction. The approach substantially enhances VLA robustness under high-latency conditions (5–7 steps), improving success rates by 6.4% in simulation and by 4.6% when transferred to real-world scale VLAs, with consistent gains demonstrated across two physical robot tasks.
📝 Abstract
Vision-Language-Action (VLA) policies are typically deployed with asynchronous inference: the robot executes a previously predicted action chunk while the model computes the next one. This creates a prediction-execution misalignment: the chunk is conditioned on the observation taken before inference began, but executes in a physical state that has already drifted forward by several control steps; naive asynchronous rollover collapses from 89% to under 1% on Kinetix as the inference cycle covers up to seven control steps. We introduce DEFLECT, a fully offline post-training refinement that applies as a near drop-in upgrade to existing async-VLA stacks by converting latency itself into a label-free preference signal: counterfactual fresh/stale action pairs are constructed from a frozen reference policy and scored under the deployment-time conditioning via an implicit flow-matching likelihood-ratio surrogate, with no human labels, reward models, or online rollouts. DEFLECT substantially extends the usable delay envelope of async VLA control, with +6.4 success-rate gain in the high-latency regime (5-7 control steps), +4.6 when transferred to a real-scale VLA at the longest delay, and consistent improvements on two real-robot tasks (a bimanual conveyor pick-and-place and a reactive whack-a-mole).
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
asynchronous inference
prediction-execution misalignment
control delay
robotic policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

delay-robustness
flow-matching
counterfactual tuning
asynchronous VLA
label-free preference