Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of current vision-language models in robotic task planning, which stems from the absence of multimodal datasets linking visual observations, natural language instructions, and executable behavior trees. To bridge this gap, the authors propose a multi-stage data synthesis pipeline that leverages large language models to automatically generate vision–language–behavior tree triplets from existing robot interaction logs. Using this newly constructed dataset—the first of its kind—they perform parameter-efficient fine-tuning (PEFT) on small-scale vision-language models ranging from 500M to 4B parameters, enabling them to output behavior trees compatible with BehaviorTree.CPP. The resulting 4B model achieves an 87% task success rate in household simulation environments, matching the performance of leading closed-source models while substantially reducing computational requirements.

Technology Category

Application Category

📝 Abstract
Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees, compatible with the BehaviorTree.CPP library, are evaluated both offline, using structural and lexical metrics, and online through the execution of household tasks in a state-of-the-art embodied simulator. Our results demonstrate that our fine-tuned 4B-parameter VLM approaches the performance of state-of-the-art closed-source models, achieving an 87\% success rate while requiring only a fraction of the computational resources.
Problem

Research questions and friction points this paper is trying to address.

behavior tree
vision-language model
robot task planning
multimodal dataset
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Behavior Tree
Vision-Language Model
Parameter-Efficient Fine-Tuning
Robot Task Planning
Embodied Simulation
🔎 Similar Papers
No similar papers found.
C
Cristiano Battistini
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milan, Italy
R
Riccardo Andrea Izzo
Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Milan, Italy
Gianluca Bardaro
Gianluca Bardaro
Research Fellow, AIRLab, Politecnico di Milano
Robotics
Matteo Matteucci
Matteo Matteucci
Full Professor, Department of Electronics Information and Bioengineering, Politecnico di Milano
RoboticsMachine LearningComputer VisionPattern Recognition