TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization on long-tail queries, lack of controllability in reasoning, and absence of fine-grained supervision in e-commerce search, this paper proposes a progressive hybrid verification reinforcement learning framework. Its core innovation is Step-level Reward Policy Optimization (SRPO), which integrates a generative step-level reward model with an offline human validator to deliver interpretable, fine-grained feedback at each reasoning step. SRPO further incorporates policy entropy regularization and multi-stage curriculum learning to enhance logical consistency and exploration robustness. Experiments on real-world e-commerce search data demonstrate that our method significantly outperforms supervised fine-tuning (SFT), direct preference optimization (DPO), and generative reward policy optimization (GRPO). It achieves consistent improvements in relevance prediction accuracy, reasoning interpretability, and generalization performance on long-tail queries.

Technology Category

Application Category

📝 Abstract
Query-product relevance analysis is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of large language models (LLMs), particularly their chain-of-thought (CoT) reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for Taobao Search Relevance (TaoSR-SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. TaoSR-SHE further incorporates two key techniques: diversified data filtering to encourage exploration across varied reasoning paths and mitigate policy entropy collapse, and multi-stage curriculum learning to foster progressive capability growth. Extensive experiments on real-world search benchmarks show that TaoSR-SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.
Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization on long-tail e-commerce queries
Provides fine-grained stepwise supervision for rule-aligned reasoning
Solves sparse feedback in reinforcement learning for complex inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stepwise Reward Policy Optimization with hybrid rewards
Diversified data filtering to prevent entropy collapse
Multi-stage curriculum learning for progressive capability growth
🔎 Similar Papers
No similar papers found.
P
Pengkun Jiao
Fudan University, Shanghai, China
Yiming Jin
Yiming Jin
University of California, San Diego
J
Jianhui Yang
Tsinghua University, Beijing, China
Chenhe Dong
Chenhe Dong
Alibaba
Deep LearningNLP
Z
Zerui Huang
Taobao & Tmall Group of Alibaba, Hangzhou, China
Shaowei Yao
Shaowei Yao
Alibaba Group & Peking University
X
Xiaojiang Zhou
Taobao & Tmall Group of Alibaba, Beijing, China
D
Dan Ou
Taobao & Tmall Group of Alibaba, Hangzhou, China
Haihong Tang
Haihong Tang
Alibaba-inc
Information retrievalRecommendation SystemNatural language processingImage&&Video Processing and understandingData mining