AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

📅 2024-09-18
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Household robots struggle to align multimodal user reminders—such as speech, text, or gestures—with visual-language model-based task planning under constraints of limited data, high modality diversity, and real-world ambiguity. Method: We propose the Reminder Internalization Adapter (RIA), a novel architecture that encodes unstructured multimodal reminders into structured instruction prompts and incorporates a dynamic retrieval mechanism over historical successful plans to enhance planning consistency and generalization. RIA is built by fine-tuning LLaVA-7B to serve as a GPT-4o adapter, integrating multimodal perception, instruction formatting, and retrieval-augmented prompt engineering. Contribution/Results: Evaluated in real home environments, RIA achieves an 86.8% task success rate—surpassing the GPT-4o baseline (21.6%) by 65.2 percentage points, representing over a fourfold improvement. This work constitutes the first systematic end-to-end solution for aligning diverse multimodal user reminders with executable robotic task plans, significantly advancing personalization, interpretability, and robustness in domestic robotics.

Technology Category

Application Category

📝 Abstract
This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/
Problem

Research questions and friction points this paper is trying to address.

Aligning VLM-powered task planning with user reminders for household robots
Addressing limited quantity and diversity of multimodal user reminders
Improving task planning accuracy via fine-tuned LLaVA-7B and dynamic retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLaVA-7B model as GPT-4o adapter
Dynamic retrieval mechanism for historical successes
Multimodal dataset with 1,500 volunteer reminders
Z
Zhaxizhuoma Zhaxizhuoma
Shanghai Artificial Intelligence Laboratory
P
Pengan Chen
Shanghai Artificial Intelligence Laboratory, The University of Hong Kong
Z
Ziniu Wu
Shanghai Artificial Intelligence Laboratory, University of Bristol
J
Jiawei Sun
Shanghai Artificial Intelligence Laboratory
D
Dong Wang
Shanghai Artificial Intelligence Laboratory
P
Peng Zhou
The University of Hong Kong
Nieqing Cao
Nieqing Cao
Assistant Professor of Xi'an Jiaotong-Liverpool University
AI/ML in Smart ManufacturingRobotics
Y
Yan Ding
Shanghai Artificial Intelligence Laboratory
B
Bin Zhao
Shanghai Artificial Intelligence Laboratory, Northwestern Polytechnical University
X
Xuelong Li
Shanghai Artificial Intelligence Laboratory, Institute of Artificial Intelligence, China Telecom Corp Ltd