DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Current large language model agents predominantly rely on a sequential “reasoning–acting” paradigm, which suffers from limited exploration capacity and insufficient environmental understanding. This work proposes a novel paradigm enabling parallel interaction across multiple environments and cross-trajectory experience sharing. To support this framework, we introduce the DPEPO reinforcement learning algorithm, which integrates supervised fine-tuning with a hierarchical reward mechanism—comprising trajectory success, action diversity, and state-transition diversity rewards—to jointly enhance exploration breadth at both trajectory and step levels while mitigating behavioral redundancy. Evaluated on ALFWorld and ScienceWorld, our approach achieves state-of-the-art success rates while maintaining execution efficiency comparable to strong sequential baselines.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)

Problem

Research questions and friction points this paper is trying to address.

LLM-based agents

limited exploration

environmental understanding

sequential interaction

single-environment interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel exploration

reinforcement learning

diverse action reward