ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the challenge of multi-objective coordination in conversational shopping agents operating in real-world scenarios, where product accuracy, persuasiveness, response quality, and tool-use efficiency must be jointly optimized. To this end, the authors propose Hierarchical Reward Modeling (HRM) and Dynamic Contrastive Policy Optimization (DCPO), along with SmartShopBench—the first hierarchical evaluation benchmark tailored for this task. HRM effectively integrates multidimensional reward signals, while DCPO enhances reasoning efficiency and policy stability through dynamic trajectory selection. Experimental results demonstrate that the proposed approach, implemented in the ChatShopBuddy system, significantly outperforms larger general-purpose reasoning models across multiple metrics, achieving superior performance and greater robustness, thereby validating its practical effectiveness in real-world applications.

Technology Category

Application Category

📝 Abstract

Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.

Problem

Research questions and friction points this paper is trying to address.

Conversational Shopping Agents

Reinforcement Learning

Multi-objective Optimization

Large Language Models

Reward Modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Reward Modeling

Dynamic Contrastive Policy Optimization

SmartShopBench