🤖 AI Summary
This work addresses the limitations of existing dense visual prediction approaches, which often suffer from error accumulation and planning drift, as well as sparse methods that lack kinematic alignment and thus compromise execution consistency. To overcome these issues, the authors propose StructVLA, a novel framework that introduces structured frames—sparse, physically meaningful keyframes derived from gripper state transitions and kinematic inflection points—as intermediate planning representations within a generative world model. Through a two-stage training scheme and a unified discrete token vocabulary, StructVLA preserves semantic conciseness while ensuring action executability. The method achieves average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO, demonstrating high reliability and strong generalization in both basic manipulation and complex long-horizon tasks, with real-world validation confirming its practical efficacy.
📝 Abstract
Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.