GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of robotic cloth manipulation in realistic scenarios where garments are often stacked, a condition poorly handled by existing methods that typically assume isolated single garments. The paper proposes a novel approach that integrates visual-language reasoning with visual affordance perception to enable safe and accurate extraction of target garments from cluttered stacks under natural language instructions. Key innovations include the first deep fusion of vision-language models (VLMs) with affordance-aware perception, the incorporation of SAM2 for high-fidelity instance segmentation enhanced by a mask fine-tuning mechanism for improved state estimation, and a dual-arm cooperative grasping strategy tailored for large or highly deformable garments. Extensive experiments in both simulation and real-world settings demonstrate the method’s robustness and high success rate in complex stacking configurations, establishing a solid foundation for downstream tasks such as folding and hanging.

Technology Category

Application Category

πŸ“ Abstract
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.
Problem

Research questions and friction points this paper is trying to address.

garment manipulation
cluttered garments
garment retrieval
vision-language reasoning
visual affordance
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language reasoning
visual affordance
garment segmentation
dual-arm cooperation
mask fine-tuning
πŸ”Ž Similar Papers
No similar papers found.