VLM-driven Behavior Tree for Context-aware Task Planning

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor adaptability and rigid condition specification in robot vision task planning within unstructured environments, this paper proposes a Vision-Language Model (VLM)-driven behavior tree framework. Our core innovation is the first-ever self-prompting visual condition node: free-text visual conditions are embedded directly into the behavior tree, and multimodal VLMs (e.g., LLaVA, Qwen-VL) perform real-time vision–language alignment to dynamically evaluate condition truth values, enabling context-aware online planning and on-the-fly task editing. Integrating dynamic prompt engineering with real-time visual reasoning, the framework achieves end-to-end deployment in a real-world café setting, validating a closed-loop workflow—from vision-semantic-driven task generation and editing to execution. Experiments demonstrate substantial improvements in task generalization capability and operational robustness for robots operating in complex, dynamic environments.

Technology Category

Application Category

📝 Abstract
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
Problem

Research questions and friction points this paper is trying to address.

Robotics
Visual Information Processing
Decision Making
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Models
Dynamic Task Planning
Adaptive Decision Making