Aligning Large Language Models with Searcher Preferences

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work proposes SearchLLM to address key challenges in open-domain generative search, including robustness to noisy retrieval, safety and compliance, and alignment with user intent. It introduces the first large language model specifically designed for open-domain generative search, featuring a hierarchical multi-dimensional reward mechanism that distinguishes between hard constraint violations and optimization objectives. The model employs gated aggregation over queries, conversation history, and retrieved evidence, and is trained using Group Relative Policy Optimization (GRPO). Safety and user experience are jointly optimized through an LLM-based evaluator calibrated via rule-based verification and human annotation. Upon deployment on RedNote, SearchLLM achieved a 1.03% increase in Valid Consumption Rate and a 2.81% reduction in Re-search Rate, while meeting stringent safety and reliability standards.

Technology Category

Application Category

📝 Abstract

The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

Problem

Research questions and friction points this paper is trying to address.

generative search

searcher preferences

noisy retrieval

safety guarantees

user alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

SearchLLM

generative search

multi-dimensional reward system