🤖 AI Summary
To address low sample efficiency and weak compositional generalization in language-instructed reinforcement learning agents for multi-task settings, this paper proposes CERLLA: the first framework that deeply integrates compositional policy representation with a reinforcement learning–driven semantic parser to achieve zero-shot compositional generalization. Methodologically, it unifies reinforcement learning, semantic parsing, in-context learning, and function approximation to enable end-to-end mapping from natural language instructions to executable policies. Evaluated on 162 compositional generalization benchmark tasks, CERLLA achieves 92% accuracy—approaching the oracle upper bound (94%) and substantially outperforming non-compositional baselines (80%)—while significantly reducing sample complexity. Its core contribution is a unified architecture that jointly enables structured semantic understanding and policy composition, establishing a novel paradigm for open-domain language–action alignment in embodied agents.
📝 Abstract
Combining reinforcement learning with language grounding is challenging as the agent needs to explore the environment while simultaneously learning multiple language-conditioned tasks. To address this, we introduce a novel method: the compositionally-enabled reinforcement learning language agent (CERLLA). Our method reduces the sample complexity of tasks specified with language by leveraging compositional policy representations and a semantic parser trained using reinforcement learning and in-context learning. We evaluate our approach in an environment requiring function approximation and demonstrate compositional generalization to novel tasks. Our method significantly outperforms the previous best non-compositional baseline in terms of sample complexity on 162 tasks designed to test compositional generalization. Our model attains a higher success rate and learns in fewer steps than the non-compositional baseline. It reaches a success rate equal to an oracle policy's upper-bound performance of 92%. With the same number of environment steps, the baseline only reaches a success rate of 80%.