Aligning Large Language Models with Searcher Preferences

πŸ“… 2026-03-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work proposes SearchLLM to address key challenges in open-domain generative search, including robustness to noisy retrieval, safety and compliance, and alignment with user intent. It introduces the first large language model specifically designed for open-domain generative search, featuring a hierarchical multi-dimensional reward mechanism that distinguishes between hard constraint violations and optimization objectives. The model employs gated aggregation over queries, conversation history, and retrieved evidence, and is trained using Group Relative Policy Optimization (GRPO). Safety and user experience are jointly optimized through an LLM-based evaluator calibrated via rule-based verification and human annotation. Upon deployment on RedNote, SearchLLM achieved a 1.03% increase in Valid Consumption Rate and a 2.81% reduction in Re-search Rate, while meeting stringent safety and reliability standards.

Technology Category

Application Category

πŸ“ Abstract
The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.
Problem

Research questions and friction points this paper is trying to address.

generative search
searcher preferences
noisy retrieval
safety guarantees
user alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

SearchLLM
generative search
multi-dimensional reward system
Gated Aggregation Strategy
Group Relative Policy Optimization
πŸ”Ž Similar Papers
No similar papers found.