Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of autonomous cocktail preparation for middle-school students using natural-language voice commands. We propose the first vision-language-action (VLA) end-to-end robotic system specifically designed for liquid manipulation tasks. Our method integrates automatic speech recognition (ASR), optical character recognition (OCR) for label identification, bimanual motion planning, force/torque-sensor-based closed-loop volumetric control, and a retrieval-augmented generation (RAG)-enhanced dynamic recipe adaptation mechanism—enabling real-time anomaly detection (e.g., missing ingredients) and responsive execution. Key contributions include: (1) a VLA architecture optimized for precise liquid pouring; (2) RAG-driven dynamic recipe retrieval and on-the-fly correction; and (3) milliliter-precision flow regulation via force-torque (FT) feedback. Experiments demonstrate 93% ASR accuracy under noisy conditions, 91% visual detection accuracy in cluttered scenes, 95% anomaly identification rate, and 100% end-to-end task success rate.

Technology Category

Application Category

📝 Abstract

This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.

Problem

Research questions and friction points this paper is trying to address.

Automated Cocktail Mixing

Multimodal Understanding

Robotic Execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shake-VLA

Multi-sensory Integration

Autonomous Cocktail Mixing

🔎 Similar Papers

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

2024-04-02IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 0