🤖 AI Summary
Existing large language models (LLMs) excel at natural language inference but are constrained to textual action spaces, requiring external hand-crafted parsers to map linguistic outputs to environment actions—leading to tight coupling between reasoning and control and inflexible interaction. This work proposes an Expanded Action space and the Environment-Aware Reinforcement Learning (EARL) framework, the first to internalize external environment operations—such as symbolic computation and simulator invocation—as native model actions, thereby decoupling language understanding from control execution and enabling dynamic environment switching and multi-turn autonomous decision-making. Our method comprises expanded action encoding, environment-aware routing, operation-feedback closed loops, and counterfactual policy optimization for training. Experiments demonstrate substantial improvements over strong baselines on calculator-based multitasking and partially observable sorting tasks: EARL achieves 100% accuracy on Sort-4 and autonomously discovers classic efficient sorting strategies.
📝 Abstract
Large Language Models (LLMs) are powerful reasoners in natural language, but their actions are typically confined to outputting vocabulary tokens. As a result, interactions with external environments -- such as symbolic operators or simulators -- must be expressed through text in predefined formats, parsed, and routed to external interfaces. This overloads the model's language with both reasoning and control duties, and requires a hand-crafted parser, external to the LLM. To address this, we decouple environment interactions from language by internalizing them in an Expanded Action space (ExpA), beyond the vocabulary. The model starts reasoning in the default language environment, but may trigger routing actions and switch to an external environment at any time. From there, the model can only invoke environment-specific actions, receive feedback from the environment, and potentially route back to language as a result. To promote effective exploration of the expanded action space and new environments, we introduce ExpA Reinforcement Learning (EARL) with counterfactual policy optimization. On tasks requiring multi-turn interactions and contingent planning, EARL outperforms strong baselines with vocabulary-constrained actions. It performs robustly across calculator-based multi-task learning and, in the partially observed sorting problem, achieves perfect Sort-4 accuracy while self-discovering an efficient algorithm competitive with classical designs.