๐ค AI Summary
Long-horizon robotic assembly tasks suffer from lengthy temporal dependencies and complex part interrelationships, rendering behavior tree (BT) planning heavily reliant on manual designโresulting in low efficiency and poor scalability. This paper proposes LLM-as-BT-Planner, the first systematic framework leveraging large language models (LLMs) to autonomously generate semantically valid, modular BTs. We introduce four novel in-context learning strategies and empirically validate the efficacy of supervised fine-tuning for small-scale LLMs on BT generation. To ensure executability and interpretability, our method jointly enforces BT syntactic constraints and instruction semantic parsing. Extensive experiments in simulation and on real robotic platforms demonstrate significant improvements over state-of-the-art LLM-based planners and handcrafted BT baselines: +28.6% task success rate, +34.1% BT structural accuracy, and enhanced execution robustness.
๐ Abstract
Robotic assembly tasks remain an open challenge due to their long horizon nature and complex part relations. Behavior trees (BTs) are increasingly used in robot task planning for their modularity and flexibility, but creating them manually can be effort-intensive. Large language models (LLMs) have recently been applied to robotic task planning for generating action sequences, yet their ability to generate BTs has not been fully investigated. To this end, we propose LLM-as-BT-Planner, a novel framework that leverages LLMs for BT generation in robotic assembly task planning. Four in-context learning methods are introduced to utilize the natural language processing and inference capabilities of LLMs for producing task plans in BT format, reducing manual effort while ensuring robustness and comprehensibility. Additionally, we evaluate the performance of fine-tuned smaller LLMs on the same tasks. Experiments in both simulated and real-world settings demonstrate that our framework enhances LLMs' ability to generate BTs, improving success rate through in-context learning and supervised fine-tuning.