🤖 AI Summary
This work proposes an "Answer-First, Reason-Later" (AFRL) paradigm to simultaneously achieve millisecond-level response latency, interpretable inference, and effective relevance modeling. The model outputs a relevance score at the first token and subsequently generates a structured explanation. By integrating supervised fine-tuning (SFT) with Stepwise-GRPO reinforcement learning and introducing an SFT auxiliary loss to mitigate mode collapse, the approach ensures stable training. Data quality is further enhanced through automated instruction evolution and multi-stage curriculum learning. Guided by an information-theoretic analysis using KL divergence, knowledge distillation efficiently transfers expert-level reasoning capabilities from a 32B teacher model to a compact 0.6B student model. This enables the small model to retain deep reasoning capacity while meeting stringent low-latency deployment requirements, achieving state-of-the-art performance.
📝 Abstract
Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a"Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)"pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to"reward hacking."On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.