🤖 AI Summary
This work addresses the failure of industrial robots in long-horizon, multi-task scenarios with dynamic object distributions, where reactive control strategies are prone to error accumulation. The authors propose a planning-and-execution framework grounded in a vision-based latent world model that generates and evaluates multiple future trajectories to select and execute the optimal action sequence, thereby achieving robust manipulation. This approach marks the first successful deployment of a vision-based latent world model in real-world, complex industrial settings, overcoming fundamental limitations of conventional reactive policies. Integrated with a vision-language-action model and a trajectory generation and scoring mechanism, the method consistently outperforms state-of-the-art approaches across four progressively complex tasks on both single- and dual-arm platforms, demonstrating exceptional reliability particularly in highly cluttered, frequently occluded, and contact-intensive environments.
📝 Abstract
Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.