🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models, which typically employ fixed-step action decoders that lack dynamic computational adjustment and cross-timestep reasoning reuse. The authors propose replacing the flow-matching decoder in π₀ with an Equilibrium Matching (EqM) decoder, introducing an energy-based action generation mechanism while preserving the upstream VLA architecture. This design enables flexible iterative inference and reveals a non-monotonic relationship between residual error and task success—termed the “stability-executability gap”—prompting the integration of inference depth into policy design. Experiments demonstrate that, under a 300-step inference budget, the method improves average success rates from 40.4% to 50.2% across 19 tasks on RoboTwin and achieves 87.0% on LIBERO-10, significantly outperforming baseline approaches.
📝 Abstract
Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $π_0$-EqM, which replaces the flow-matching expert in $π_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $π_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.