🤖 AI Summary
This work addresses key challenges in deploying enterprise AI agents—namely, low-quality or scarce data, complex reasoning requirements, difficulties in self-play, and the absence of reliable feedback. To overcome these limitations, we propose a lightweight, model-agnostic offline reinforcement learning framework that innovatively integrates a digital twin Markov decision process (DT-MDP) with contrastive inverse reinforcement learning. This approach efficiently recovers reward functions from trajectories of mixed quality and leverages them to optimize contextual prompting for large language model (LLM) agents, thereby enhancing decision-making. Evaluated on enterprise IT automation tasks, our method consistently and significantly outperforms existing baselines across multiple evaluation settings.
📝 Abstract
Despite rapid progress in AI agents for enterprise automation and decision-making, their real-world deployment and further performance gains remain constrained by limited data quality and quantity, complex real-world reasoning demands, difficulties with self-play, and the lack of reliable feedback signals. To address these challenges, we propose a lightweight, model-agnostic framework for improving LLM-based enterprise agents via offline reinforcement learning (RL). The proposed Context Engineering via DT-MDP (DT-MDP-CE) framework comprises three key components: (1) A Digital-Twin Markov Decision Process (DT-MDP), which abstracts the agent's reasoning behavior as a finite MDP; (2) A robust contrastive inverse RL, which, armed with the DT-MDP, to efficiently estimate a well-founded reward function and induces policies from mixed-quality offline trajectories; and (3) RL-guided context engineering, which uses the policy obtained from the integrated process of (1) and (2), to improve the agent's decision-making behavior. As a case study, we apply the framework to a representative task in the enterprise-oriented domain of IT automation. Extensive experimental results demonstrate consistent and significant improvements over baseline agents across a wide range of evaluation settings, suggesting that the framework can generalize to other agents sharing similar characteristics in enterprise environments.