🤖 AI Summary
Existing chunked visual-language-action (VLA) policies struggle to maintain up-to-date scene priors after action updates in multi-step robotic control, leading to decisions based on stale geometric information. This work proposes EvoScene-VLA, which introduces, for the first time within a chunked control framework, an action-driven dynamic scene belief evolution mechanism. By recursively maintaining a scene prefix, the method fuses current observations with the scene state updated by the previous action at each visual-language model invocation, simultaneously outputting an action chunk and a compact scene update. The approach integrates a geometry-anchoring module and a scene predictor—used only during training—while retaining only a lightweight update mechanism at deployment. Evaluated on 31 RoboTwin tasks, EvoScene-VLA achieves success rates of 89.1% and 88.5% under fixed and randomized conditions, respectively, and consistently outperforms baselines on the real-world Galaxea R1-Lite robot.
📝 Abstract
Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.