$π_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models, which typically employ fixed-step action decoders that lack dynamic computational adjustment and cross-timestep reasoning reuse. The authors propose replacing the flow-matching decoder in π₀ with an Equilibrium Matching (EqM) decoder, introducing an energy-based action generation mechanism while preserving the upstream VLA architecture. This design enables flexible iterative inference and reveals a non-monotonic relationship between residual error and task success—termed the “stability-executability gap”—prompting the integration of inference depth into policy design. Experiments demonstrate that, under a 300-step inference budget, the method improves average success rates from 40.4% to 50.2% across 19 tasks on RoboTwin and achieves 87.0% on LIBERO-10, significantly outperforming baseline approaches.

📝 Abstract

Currently, Vision-Language-Action (VLA) models have become the most adopted paradigm for robotic manipulation for its great potential for task generalization. While most generative flow-matching action decoders for VLA control are often deployed with fixed sampling horizons, limiting state-dependent compute and temporal reuse across control cycles. We present $π_0$-EqM, which replaces the flow-matching expert in $π_0$ with an Equilibrium Matching (EqM) decoder while leaving the upstream VLA stack unchanged. Under a matched 300-step budget, $π_0$-EqM improves RoboTwin average success from 40.4% to 50.2% across 19 tasks and remains competitive on LIBERO, with its clearest gain on LIBERO-10 (87.0%). Two threshold scans reveal a task-dependent non-monotonic relation between residual and success, which we term the stationarity--executability gap. The results suggest that inference depth in iterative VLA control is part of policy design and introduce an energy-based VLA perspective that may inform future work on composable action generation across tasks and embodiments.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

flow-matching

action decoding

control horizon

state-dependent computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Equilibrium Matching

Vision-Language-Action

Iterative Inference