๐ค AI Summary
To address the real-time aerial manipulation challenge of highly self-occluded, crumpled, and suspended garments, this paper proposes a bimanual, confidence-aware manipulation framework. Methodologically: (1) a dense visual correspondence model trained with distributional loss enables robust inter-frame feature matching; (2) a tactile self-supervised visionโtactile grasp accessibility network jointly predicts graspable regions and associated uncertainty; (3) a perception-confidence-driven reactive state machine supports task-agnostic grasp selection and cross-modal policy transfer. Evaluated in both simulation and real-world settings, the system achieves, for the first time, stable folding and hanging of severely occluded, suspended clothing. Moreover, it generalizes grasp targets directly from human demonstration videos. Experiments demonstrate significant improvements in dynamic adaptability, robustness to occlusion and deformation, and cross-task generalization capability.
๐ Abstract
Manipulating clothing is challenging due to complex configurations, variable material dynamics, and frequent self-occlusion. Prior systems often flatten garments or assume visibility of key features. We present a dual-arm visuotactile framework that combines confidence-aware dense visual correspondence and tactile-supervised grasp affordance to operate directly on crumpled and suspended garments. The correspondence model is trained on a custom, high-fidelity simulated dataset using a distributional loss that captures cloth symmetries and generates correspondence confidence estimates. These estimates guide a reactive state machine that adapts folding strategies based on perceptual uncertainty. In parallel, a visuotactile grasp affordance network, self-supervised using high-resolution tactile feedback, determines which regions are physically graspable. The same tactile classifier is used during execution for real-time grasp validation. By deferring action in low-confidence states, the system handles highly occluded table-top and in-air configurations. We demonstrate our task-agnostic grasp selection module in folding and hanging tasks. Moreover, our dense descriptors provide a reusable intermediate representation for other planning modalities, such as extracting grasp targets from human video demonstrations, paving the way for more generalizable and scalable garment manipulation.