Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an "Answer-First, Reason-Later" (AFRL) paradigm to simultaneously achieve millisecond-level response latency, interpretable inference, and effective relevance modeling. The model outputs a relevance score at the first token and subsequently generates a structured explanation. By integrating supervised fine-tuning (SFT) with Stepwise-GRPO reinforcement learning and introducing an SFT auxiliary loss to mitigate mode collapse, the approach ensures stable training. Data quality is further enhanced through automated instruction evolution and multi-stage curriculum learning. Guided by an information-theoretic analysis using KL divergence, knowledge distillation efficiently transfers expert-level reasoning capabilities from a 32B teacher model to a compact 0.6B student model. This enables the small model to retain deep reasoning capacity while meeting stringent low-latency deployment requirements, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a"Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)"pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to"reward hacking."On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.
Problem

Research questions and friction points this paper is trying to address.

search relevance
latency
reasoning
mode collapse
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Answer-First Reason Later
Mode-Balanced Reinforcement Learning
Reverse KL Divergence
Knowledge Distillation
Search Relevance
S
Shijie Zhang
Qwen Applications Business Group, Alibaba Group; Peking University
Xiang Guo
Xiang Guo
Yale
R
Rujun Guo
Qwen Applications Business Group, Alibaba Group
S
Shaoyu Liu
Qwen Applications Business Group, Alibaba Group
X
Xiaozhao Wang
Qwen Applications Business Group, Alibaba Group
G
Guanjun Jiang
Qwen Applications Business Group, Alibaba Group
Kevin Zhang
Kevin Zhang
Peking University
ML