Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

234K/year
🤖 AI Summary
This work addresses the failure of industrial robots in long-horizon, multi-task scenarios with dynamic object distributions, where reactive control strategies are prone to error accumulation. The authors propose a planning-and-execution framework grounded in a vision-based latent world model that generates and evaluates multiple future trajectories to select and execute the optimal action sequence, thereby achieving robust manipulation. This approach marks the first successful deployment of a vision-based latent world model in real-world, complex industrial settings, overcoming fundamental limitations of conventional reactive policies. Integrated with a vision-language-action model and a trajectory generation and scoring mechanism, the method consistently outperforms state-of-the-art approaches across four progressively complex tasks on both single- and dual-arm platforms, demonstrating exceptional reliability particularly in highly cluttered, frequently occluded, and contact-intensive environments.

Technology Category

Application Category

📝 Abstract
Industrial robotic manipulation demands reliable long-horizon execution across embodiments, tasks, and changing object distributions. While Vision-Language-Action models have demonstrated strong generalization, they remain fundamentally reactive. By optimizing the next action given the current observation without evaluating potential futures, they are brittle to the compounding failure modes of long-horizon tasks. Cortex 2.0 shifts from reactive control to plan-and-act by generating candidate future trajectories in visual latent space, scoring them for expected success and efficiency, then committing only to the highest-scoring candidate. We evaluate Cortex 2.0 on a single-arm and dual-arm manipulation platform across four tasks of increasing complexity: pick and place, item and trash sorting, screw sorting, and shoebox unpacking. Cortex 2.0 consistently outperforms state-of-the-art Vision-Language-Action baselines, achieving the best results across all tasks. The system remains reliable in unstructured environments characterized by heavy clutter, frequent occlusions, and contact-rich manipulation, where reactive policies fail. These results demonstrate that world-model-based planning can operate reliably in complex industrial environments.
Problem

Research questions and friction points this paper is trying to address.

industrial robotic manipulation
long-horizon tasks
reactive control
compounding failure modes
world models
Innovation

Methods, ideas, or system contributions that make the work stand out.

world model
plan-and-act
visual latent space
long-horizon manipulation
industrial robotics
🔎 Similar Papers
No similar papers found.
A
Adriana Aida
Sereact GmbH
W
Walida Amer
Sereact GmbH
K
Katarina Bankovic
Sereact GmbH
D
Dhruv Behl
Sereact GmbH
F
Fabian Busch
Sereact GmbH
A
Annie Bhalla
Sereact GmbH
M
Minh Duong
Sereact GmbH
F
Florian Gienger
Sereact GmbH
R
Rohan Godse
Sereact GmbH
D
Denis Grachev
Sereact GmbH
Ralf Gulde
Ralf Gulde
University of Stuttgart
AIRoboticsLLMs
E
Elisa Hagensieker
Sereact GmbH
J
Junpeng Hu
Sereact GmbH
S
Shivam Joshi
Sereact GmbH
T
Tobias Knoblauch
Sereact GmbH
L
Likith Kumar
Sereact GmbH
Damien LaRocque
Damien LaRocque
Université Laval
Field RoboticsRobotique de terrain
K
Keerthana Lokesh
Sereact GmbH
Omar Moured
Omar Moured
Karlsruhe Institue of Technology
Computer VisionVision-Language ModelsDocument AnalysisAssistive Tech
K
Khiem Nguyen
Sereact GmbH
C
Christian Preyss
Sereact GmbH
R
Ranjith Sriganesan
Sereact GmbH
V
Vikram Singh
Sereact GmbH
C
Carsten Sponner
Sereact GmbH
Anh Tong
Anh Tong
Korea University
Bayesian InferenceGaussian ProcessesNeural Differential Equations