🤖 AI Summary
To address the insufficient real-time decision-making capability of domestic service robots on resource-constrained edge devices, this paper proposes RDMM—a lightweight, domain-specific, on-device decision-making model. Methodologically, we introduce the first on-device large language model (LLM) framework supporting capability-aware reasoning and context-aware adaptation; integrate ViT, Whisper, and CLIP for multimodal visual–auditory understanding; and design a real-time vision–language–action joint planning architecture. Through model fine-tuning, inference optimization, and on-device deployment under an 8 GB memory constraint, the entire pipeline operates fully locally. Key contributions include: (1) releasing the first household task-oriented dataset comprising 27K planning instances and 1.3K image–text annotations; (2) achieving 93% planning accuracy; and (3) open-sourcing all models, code, benchmarks, and datasets to enable real-time deployment on edge hardware.
📝 Abstract
Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at https://github.com/shadynasrat/RDMM.