Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Current large language models (LLMs) struggle to reason about latent physical attributes—such as weight, stability, and base area—in long-horizon robotic stacking tasks. To address this, we propose a preference-aware multimodal large language model (MLLM) planning framework. Our method constructs a custom dataset capturing multidimensional stacking preferences and performs instruction tuning on the MLLM to implicitly encode physical constraints and multi-objective trade-offs—enabling robust plan generation without explicit prompting. By jointly leveraging visual observations and semantic cues, our approach significantly improves task success rates in simulation and, for the first time, enables online, real-time, long-horizon stacking on a physical humanoid robot platform. Key contributions include: (1) formalizing and modeling implicit physical preferences; (2) introducing a lightweight preference-aware fine-tuning paradigm; and (3) empirically validating the MLLM’s efficacy and generalizability as a high-level planner in embodied AI systems.

Technology Category

Application Category

📝 Abstract

Pretrained large language models (LLMs) can work as high-level robotic planners by reasoning over abstract task descriptions and natural language instructions, etc. However, they have shown a lack of knowledge and effectiveness in planning long-horizon robotic manipulation tasks where the physical properties of the objects are essential. An example is the stacking of containers with hidden objects inside, which involves reasoning over hidden physics properties such as weight and stability. To this end, this paper proposes to use multimodal LLMs as high-level planners for such long-horizon robotic stacking tasks. The LLM takes multimodal inputs for each object to stack and infers the current best stacking sequence by reasoning over stacking preferences. Furthermore, in order to enable the LLM to reason over multiple preferences at the same time without giving explicit instructions, we propose to create a custom dataset considering stacking preferences including weight, stability, size, and footprint, to fine-tune the LLM. Compared to the pretrained LLM with prompt tuning, we demonstrate the improved stacking completion of the LLM fine-tuned with our custom dataset via large-scale simulation evaluation. Furthermore, we showcase the effectiveness of the proposed framework for the long-horizon stacking task on a real humanoid robot in an online manner.

Problem

Research questions and friction points this paper is trying to address.

Planning long-horizon robotic stacking tasks with hidden physics

Reasoning over object properties like weight and stability

Improving stacking completion via preference-based multimodal LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs plan robotic stacking sequences

Fine-tuning with custom dataset for stacking preferences

Real-world validation on humanoid robot online

🔎 Similar Papers

No similar papers found.