🤖 AI Summary
This work addresses the poor scalability of high-level planning in robotic visual rearrangement tasks, which traditionally relies on manually designed abstract representations. We propose an end-to-end method for automatically learning discrete, graph-structured abstractions from raw visual input. Our core contribution is the first integration of vision-guided graph coloring with the inherent bipartite structure of rearrangement problems: visual distances are modeled via attention mechanisms, and structural constraints are imposed to regularize the coloring process—enabling automatic discovery of semantically meaningful, planning-friendly discrete states. The approach employs a vision encoder to extract embeddings, requiring no human priors or symbolic annotations. Evaluated on two simulated rearrangement benchmarks, the learned abstractions are robust and stable, significantly improving high-level planning efficiency and outperforming existing automated abstraction-learning methods.
📝 Abstract
Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills -- an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discovery of useful abstractions directly from visual data would make planning frameworks more scalable and more applicable to real-world robotic domains. In this work, we focus on rearrangement tasks where the state is represented with raw images, and propose a method to induce discrete, graph-structured abstractions by combining structural constraints with an attention-guided visual distance. Our approach leverages the inherent bipartite structure of rearrangement problems, integrating structural constraints and visual embeddings into a unified framework. This enables the autonomous discovery of abstractions from vision alone, which can subsequently support high-level planning. We evaluate our method on two rearrangement tasks in simulation and show that it consistently identifies meaningful abstractions that facilitate effective planning and outperform existing approaches.