GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of robotic cloth manipulation in realistic scenarios where garments are often stacked, a condition poorly handled by existing methods that typically assume isolated single garments. The paper proposes a novel approach that integrates visual-language reasoning with visual affordance perception to enable safe and accurate extraction of target garments from cluttered stacks under natural language instructions. Key innovations include the first deep fusion of vision-language models (VLMs) with affordance-aware perception, the incorporation of SAM2 for high-fidelity instance segmentation enhanced by a mask fine-tuning mechanism for improved state estimation, and a dual-arm cooperative grasping strategy tailored for large or highly deformable garments. Extensive experiments in both simulation and real-world settings demonstrate the method’s robustness and high success rate in complex stacking configurations, establishing a solid foundation for downstream tasks such as folding and hanging.

Technology Category

Application Category

📝 Abstract

Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.

Problem

Research questions and friction points this paper is trying to address.

garment manipulation

cluttered garments

garment retrieval

vision-language reasoning

visual affordance

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language reasoning

visual affordance

garment segmentation