Video-to-BT: Generating Reactive Behavior Trees from Human Demonstration Videos for Robotic Assembly

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional robotic assembly systems suffer from limited flexibility and robustness due to reliance on manual programming. Method: This paper proposes an end-to-end closed-loop framework integrating Vision-Language Models (VLMs) with Behavior Trees (BTs). It automatically parses semantic intent, decomposes subtasks, and generates executable BTs directly from human demonstration videos—unifying high-level cognitive planning (VLM-driven) with low-level reactive control for the first time. Furthermore, it introduces a VLM-guided online failure detection and re-planning mechanism to enable real-time adaptation under dynamic disturbances. Results: Evaluated on real-world assembly tasks, the system demonstrates high planning reliability, long-horizon stability, strong environmental robustness, and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision-Language Model (VLM) to decompose human demonstration videos into subtasks, from which Behavior Trees are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM-driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions. Project website: https://video2bt.github.io/video2bt_page/
Problem

Research questions and friction points this paper is trying to address.

Generating reactive behavior trees from human demonstration videos for robotic assembly
Overcoming inflexibility of traditional programming approaches for product variations
Integrating high-level cognitive planning with low-level reactive control using BTs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates Behavior Trees from human demonstration videos
Uses Vision-Language Model for subtask decomposition and planning
Enables reactive execution with real-time VLM-driven replanning
X
Xiwei Zhao
Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich, Germany
Y
Yiwei Wang
Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich, Germany
Yansong Wu
Yansong Wu
TUM
roboticstactile manipulationrobot learningbehavior tree
F
Fan Wu
Shanghai University, Shanghai, China
Teng Sun
Teng Sun
Shandong University
Multimedia computinginformation retrievalcausal inference
Z
Zhonghua Miao
Shanghai University, Shanghai, China
Sami Haddadin
Sami Haddadin
MBZUAI
RoboticsAIControlNeurotechAutomating Science
Alois Knoll
Alois Knoll
Technische Universität München
RoboticsAISensor Data FusionAutonomous DrivingCyber Physical Systems