TaoSR1: The Thinking Model for E-commerce Relevance Search

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Query-item relevance prediction in e-commerce search faces two key challenges: (1) BERT-based models lack sophisticated reasoning capabilities, while (2) direct deployment of large language models (LLMs) suffers from error accumulation in chain-of-thought (CoT) reasoning, discriminative hallucinations, and failure to meet stringent online latency requirements. To address these, we propose a three-stage end-to-end inference framework integrating CoT optimization, dynamic hard-negative sampling, and Grouped Relative Policy Optimization (GRPO), coupled with supervised fine-tuning, Pass@N sampling, Direct Preference Optimization (DPO), and cumulative-probability-based partitioned deployment. Our approach effectively mitigates error propagation and hallucination generation. Extensive offline evaluations demonstrate consistent superiority over strong baselines; online human evaluation shows significant improvement. The results validate the effectiveness of our LLM-augmented relevance modeling—achieving high accuracy, low latency, and production readiness.

Technology Category

Application Category

📝 Abstract

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

Problem

Research questions and friction points this paper is trying to address.

Enhancing query-product relevance prediction with reasoning

Addressing Chain-of-Thought error accumulation in LLMs

Mitigating discriminative hallucination for deployment feasibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought fine-tuning for reasoning enhancement

DPO and pass@N for generation quality

GRPO and dynamic sampling for hallucination mitigation

🔎 Similar Papers

No similar papers found.