🤖 AI Summary
Current large language model agents predominantly rely on a sequential “reasoning–acting” paradigm, which suffers from limited exploration capacity and insufficient environmental understanding. This work proposes a novel paradigm enabling parallel interaction across multiple environments and cross-trajectory experience sharing. To support this framework, we introduce the DPEPO reinforcement learning algorithm, which integrates supervised fine-tuning with a hierarchical reward mechanism—comprising trajectory success, action diversity, and state-transition diversity rewards—to jointly enhance exploration breadth at both trajectory and step levels while mitigating behavioral redundancy. Evaluated on ALFWorld and ScienceWorld, our approach achieves state-of-the-art success rates while maintaining execution efficiency comparable to strong sequential baselines.
📝 Abstract
Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)