From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control

📅 2024-05-08

🏛️ IEEE/RJS International Conference on Intelligent RObots and Systems

📈 Citations: 15

✨ Influential: 1

career value

188K/year

🤖 AI Summary

In hierarchical robot control, rigid interfaces between high-level task planners and low-level policies hinder performance—existing LLM-based natural language interfaces fail to represent non-linguistic tasks (e.g., dance motions) and cannot be end-to-end fine-tuned on embodied data without catastrophic forgetting or domain shift. Method: We propose a novel hierarchical control architecture centered on a learnable variational latent code serving as a semantic-neutral bridge. This code disentangles linguistic modalities, enables non-linguistic task representation, supports joint optimization, and preserves pre-trained word embedding geometry. Our approach integrates a multimodal LLM (GPT-4V), latent code learning, cross-modal alignment, and end-to-end reinforcement learning, trained jointly on Language Table and CALVIN benchmarks. Contribution/Results: Experiments demonstrate significant improvements in task completion rate and generalization robustness—particularly for multi-step reasoning and non-linguistic action tasks—outperforming pure language-interface baselines.

Technology Category

Application Category

📝 Abstract

Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method – Latent Codes as Bridges (LCB) – as an alternate architecture to overcome these limitations. LCB uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that LCB outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.

Problem

Research questions and friction points this paper is trying to address.

Hierarchical robot control lacks flexible interface between planners and policies

Language interfaces limit task decomposition and end-to-end finetuning

Latent codes bridge LLMs and low-level policies for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable latent codes bridge LLMs and policies

Enables flexible goal communication beyond language

Supports end-to-end finetuning without embedding destruction

🔎 Similar Papers

LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning