Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success

📅 2025-06-12

📈 Citations: 1

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses high-speed multi-suction-cup grasping in warehouse settings involving disordered stacking and open-set objects. Method: We propose a sparse-annotation–guided, multimodal (RGB/depth/semantic segmentation) visual encoder trained on real industrial data to predict grasp success probability. To our knowledge, this is the first approach to jointly optimize multimodal pretraining and domain-specific fine-tuning in large-scale deployment; we find that cross-modal relationships are effectively captured during pretraining, enabling high inference accuracy even with partial modality inputs. Contributions/Results: (1) A lightweight, low-latency, cross-domain multimodal grasping model; (2) Significant success-rate improvements on challenging datasets featuring large objects, severe occlusion, and deformable container-like items; (3) Real-world validation confirming industrial-grade throughput, real-time performance (<50 ms latency), and strong generalization to open-set objects and complex stacking configurations.

Technology Category

Application Category

📝 Abstract

This work demonstrates how autonomously learning aspects of robotic operation from sparsely-labeled, real-world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi-suction robot picking and performs a comprehensive study on the application of multi-modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real-world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi-suction picks. The strategy is trained from real-world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item-picking dataset, an item-picking dataset targeted to include partial occlusions, and a package-picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in-domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input.

Problem

Research questions and friction points this paper is trying to address.

Predict multi-suction robot pick success using multi-modal learning

Handle diverse items in unstructured piles for warehouse automation

Optimize picking performance with RGB, depth, and segmentation inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal learning for pick success prediction

Real-world data training with multimodal pretrain

Multi-suction picking using RGB, depth, segmentation

🔎 Similar Papers

No similar papers found.