π€ AI Summary
This work addresses the lack of benchmarks for evaluating complex spatial reasoning and manipulation capabilities in embodied AI. We introduce KnotGymβthe first interactive, image-only, goal-directed evaluation environment centered on knot manipulation. Its novelty lies in defining a quantifiable and scalable complexity axis based on knot crossing number; employing minimal visual input (single-frame RGB images) to emphasize tight coupling among perception, reasoning, and control; and establishing a standardized generalization benchmark. Methodologically, we integrate physics-based rope dynamics simulation, model-based reinforcement learning, model predictive control, and chain-of-thought visual reasoning for end-to-end training. Extensive experiments reveal significant generalization bottlenecks across complexity levels in current approaches. The codebase and benchmark are publicly released, providing a reproducible, extensible platform for evaluating spatial intelligence.
π Abstract
We propose KnotGym, an interactive environment for complex, spatial reasoning and manipulation. KnotGym includes goal-oriented rope manipulation tasks with varying levels of complexity, all requiring acting from pure image observations. Tasks are defined along a clear and quantifiable axis of complexity based on the number of knot crossings, creating a natural generalization test. KnotGym has a simple observation space, allowing for scalable development, yet it highlights core challenges in integrating acute perception, spatial reasoning, and grounded manipulation. We evaluate methods of different classes, including model-based RL, model-predictive control, and chain-of-thought reasoning, and illustrate the challenges KnotGym presents. KnotGym is available at https://github.com/lil-lab/knotgym.