TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
E-commerce search faces challenges in modeling long-tail and complex query–item relevance, where conventional methods exhibit weak multi-step reasoning capabilities; additionally, reinforcement learning approaches (e.g., GRPO) suffer from slow convergence due to sparse terminal rewards. Method: We propose a rule-aware reward shaping and adaptive guided replay mechanism that decomposes final relevance judgments into dense, structured, stepwise reasoning rewards, dynamically correcting erroneous reasoning paths via domain-specific rules. Our framework jointly trains large language models end-to-end using supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO), with explicit rule embedding. Contribution/Results: Offline experiments significantly outperform DPO and GRPO baselines, achieving higher relevance accuracy and improved rule adherence while enhancing training stability. The method has been deployed in Taobao’s primary search engine, serving hundreds of millions of users.

Technology Category

Application Category

📝 Abstract
Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning for e-commerce search relevance
Addressing sparse rewards in reinforcement learning for relevance
Improving rule adherence in query-product semantic matching
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rule-aware reward shaping aligns with domain criteria
Adaptive guided replay injects ground-truth guidance
Framework improves training stability and rule adherence
🔎 Similar Papers
No similar papers found.
J
Jianhui Yang
Tsinghua University, Beijing, China
Yiming Jin
Yiming Jin
University of California, San Diego
P
Pengkun Jiao
Fudan University, Shanghai, China
Chenhe Dong
Chenhe Dong
Alibaba
Deep LearningNLP
Z
Zerui Huang
Taobao & Tmall Group of Alibaba, Hangzhou, China
Shaowei Yao
Shaowei Yao
Alibaba Group & Peking University
X
Xiaojiang Zhou
Taobao & Tmall Group of Alibaba, Beijing, China
D
Dan Ou
Taobao & Tmall Group of Alibaba, Hangzhou, China
Haihong Tang
Haihong Tang
Alibaba-inc
Information retrievalRecommendation SystemNatural language processingImage&&Video Processing and understandingData mining