🤖 AI Summary
In hierarchical robot control, rigid interfaces between high-level task planners and low-level policies hinder performance—existing LLM-based natural language interfaces fail to represent non-linguistic tasks (e.g., dance motions) and cannot be end-to-end fine-tuned on embodied data without catastrophic forgetting or domain shift. Method: We propose a novel hierarchical control architecture centered on a learnable variational latent code serving as a semantic-neutral bridge. This code disentangles linguistic modalities, enables non-linguistic task representation, supports joint optimization, and preserves pre-trained word embedding geometry. Our approach integrates a multimodal LLM (GPT-4V), latent code learning, cross-modal alignment, and end-to-end reinforcement learning, trained jointly on Language Table and CALVIN benchmarks. Contribution/Results: Experiments demonstrate significant improvements in task completion rate and generalization robustness—particularly for multi-step reasoning and non-linguistic action tasks—outperforming pure language-interface baselines.
📝 Abstract
Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method – Latent Codes as Bridges (LCB) – as an alternate architecture to overcome these limitations. LCB uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that LCB outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.