🤖 AI Summary
Large language models (LLMs) in robot control suffer from over-reliance on manually engineered in-context examples and exhibit unpredictable hallucinations, hindering reliable language-to-action mapping.
Method: We propose a zero-shot natural language-to-control policy transfer framework that replaces conventional in-context learning with a synergistic integration of LLM semantic embeddings, inverse optimal control (IOC), and multi-task learning. Task similarity is quantified using real teleoperated demonstrations, enabling end-to-end mapping from linguistic instructions to robot action policies.
Contribution/Results: The framework ensures hallucination detectability prior to task execution and supports few-shot adaptation. Evaluated on both simulated and real-world robotic arm tabletop manipulation tasks, it significantly improves cross-task generalization and—critically—achieves the first demonstration of robust, context-example-free language-driven robot control.
📝 Abstract
The integration of large language models (LLMs) with control systems has demonstrated significant potential in various settings, such as task completion with a robotic manipulator. A main reason for this success is the ability of LLMs to perform in-context learning, which, however, strongly relies on the design of task examples, closely related to the target tasks. Consequently, employing LLMs to formulate optimal control problems often requires task examples that contain explicit mathematical expressions, designed by trained engineers. Furthermore, there is often no principled way to evaluate for hallucination before task execution. To address these challenges, we propose DEMONSTRATE, a novel methodology that avoids the use of LLMs for complex optimization problem generations, and instead only relies on the embedding representations of task descriptions. To do this, we leverage tools from inverse optimal control to replace in-context prompt examples with task demonstrations, as well as the concept of multitask learning, which ensures target and example task similarity by construction. Given the fact that hardware demonstrations can easily be collected using teleoperation or guidance of the robot, our approach significantly reduces the reliance on engineering expertise for designing in-context examples. Furthermore, the enforced multitask structure enables learning from few demonstrations and assessment of hallucinations prior to task execution. We demonstrate the effectiveness of our method through simulation and hardware experiments involving a robotic arm tasked with tabletop manipulation.