🤖 AI Summary
Autonomous tidying of cluttered household objects remains challenging due to the subjectivity and context-dependence of “tidiness,” which defies rigid geometric or pose-based definitions.
Method: We propose a robot imitation learning framework that models tidiness as a generalizable visual-semantic concept—rather than a fixed target configuration—using NLP-inspired self-supervised contrastive learning. Our approach employs a Transformer-based sequential placement predictor, integrated with RGB-based visual perception and closed-loop robotic arm control, enabling knolling-style organization of arbitrary numbers and shapes of objects.
Contribution/Results: The method achieves zero-shot generalization to unseen objects and layouts, supports multi-solution generation, and incorporates human preferences via learnable reward modulation. Experiments on a real robotic platform demonstrate a 42% improvement in user-preference alignment success rate, effectively overcoming dual bottlenecks in subjective aesthetic modeling and dynamic task adaptation.
📝 Abstract
Addressing the challenge of organizing scattered items in domestic spaces is complicated by the diversity and subjective nature of tidiness. Just as the complexity of human language allows for multiple expressions of the same idea, household tidiness preferences and organizational patterns vary widely, so presetting object locations would limit the adaptability to new objects and environments. Inspired by advancements in natural language processing (NLP), this paper introduces a self-supervised learning framework that allows robots to understand and replicate the concept of tidiness from demonstrations of well-organized layouts, akin to using conversational datasets to train Large Language Models(LLM). We leverage a transformer neural network to predict the placement of subsequent objects. We demonstrate a ``knolling'' system with a robotic arm and an RGB camera to organize items of varying sizes and quantities on a table. Our method not only trains a generalizable concept of tidiness, enabling the model to provide diverse solutions and adapt to different numbers of objects, but it can also incorporate human preferences to generate customized tidy tables without explicit target positions for each object.