🤖 AI Summary
Planning for multi-robot collaborative execution of complex manipulation tasks in unstructured environments remains challenging due to scene ambiguity, instruction variability, and stringent safety and coordination constraints.
Method: This paper proposes an intent-driven end-to-end planning framework: (1) unified scene–instruction representation via vision-language joint encoding; (2) an LLM-integrated generator producing action sequences under action-format constraints, temporal dependencies, and multi-arm collaboration requirements; and (3) deterministic consistency filtering coupled with verifiability constraints to ensure action safety and logical correctness.
Results: Evaluated on flexible battery disassembly for electric vehicles across 200 real-world scenes and 600 natural-language instructions, the method significantly outperforms five baselines—achieving +23.6% full-sequence accuracy and +18.4% next-step prediction accuracy—while reducing user cognitive load and execution latency.
📝 Abstract
This paper addresses the problem of planning complex manipulation tasks, in which multiple robots with different end-effectors and capabilities, informed by computer vision, must plan and execute concatenated sequences of actions on a variety of objects that can appear in arbitrary positions and configurations in unstructured scenes. We propose an intent-driven planning pipeline which can robustly construct such action sequences with varying degrees of supervisory input from a human using simple language instructions. The pipeline integrates: (i) perception-to-text scene encoding, (ii) an ensemble of large language models (LLMs) that generate candidate removal sequences based on the operator's intent, (iii) an LLM-based verifier that enforces formatting and precedence constraints, and (iv) a deterministic consistency filter that rejects hallucinated objects. The pipeline is evaluated on an example task in which two robot arms work collaboratively to dismantle an Electric Vehicle battery for recycling applications. A variety of components must be grasped and removed in specific sequences, determined by human instructions and/or by task-order feasibility decisions made by the autonomous system. On 200 real scenes with 600 operator prompts across five component classes, we used metrics of full-sequence correctness and next-task correctness to evaluate and compare five LLM-based planners (including ablation analyses of pipeline components). We also evaluated the LLM-based human interface in terms of time to execution and NASA TLX with human participant experiments. Results indicate that our ensemble-with-verification approach reliably maps operator intent to safe, executable multi-robot plans while maintaining low user effort.