🤖 AI Summary
This work addresses the challenge of autonomous cocktail preparation for middle-school students using natural-language voice commands. We propose the first vision-language-action (VLA) end-to-end robotic system specifically designed for liquid manipulation tasks. Our method integrates automatic speech recognition (ASR), optical character recognition (OCR) for label identification, bimanual motion planning, force/torque-sensor-based closed-loop volumetric control, and a retrieval-augmented generation (RAG)-enhanced dynamic recipe adaptation mechanism—enabling real-time anomaly detection (e.g., missing ingredients) and responsive execution. Key contributions include: (1) a VLA architecture optimized for precise liquid pouring; (2) RAG-driven dynamic recipe retrieval and on-the-fly correction; and (3) milliliter-precision flow regulation via force-torque (FT) feedback. Experiments demonstrate 93% ASR accuracy under noisy conditions, 91% visual detection accuracy in cluttered scenes, 95% anomaly identification rate, and 100% end-to-end task success rate.
📝 Abstract
This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.