Shaken, Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

262K/year

🤖 AI Summary

Detecting transparent or reflective glassware remains challenging in human-robot collaborative bartending, particularly due to poor subclass discrimination in open-vocabulary detection models. Method: We introduce GlassBartend—the first robot-operational RGB-D dataset of glassware with pixel-accurate ground-truth annotations (7,850 frames, five viewpoints)—and propose a depth-sensor-driven automated annotation pipeline to drastically reduce manual labeling effort. Our approach integrates depth-guided annotation, open-vocabulary object detection training, and embodied AI system deployment. Contribution/Results: We achieve the first end-to-end closed-loop bartending demonstration on the humanoid robot NICOL. Experiments show our baseline model outperforms existing open-vocabulary detectors, achieving an 81% task success rate on NICOL—establishing a new benchmark and practical paradigm for transparent-object perception and embodied manipulation.

Technology Category

Application Category

📝 Abstract

Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue to robotic applications that suffer from accumulating errors between detection, planning, and action execution. The paper introduces a novel method for the acquisition of real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The data set consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

Problem

Research questions and friction points this paper is trying to address.

Lack of diverse glass datasets for object detection

Difficulty in distinguishing glass subclasses in robotic tasks

Errors in detection, planning, and action execution in robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-D sensors for real-world data acquisition

Auto-labeling pipeline using depth measurements

Novel glass object dataset for robotic applications

🔎 Similar Papers

No similar papers found.