SAGE: Scalable AI Governance&Evaluation

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of detecting high-impact relevance failures in large-scale search systems, where limited human annotation resources render conventional user-behavior-based or sparse-review approaches insufficient. To this end, we propose SAGE, a novel framework that introduces a co-evolutionary calibration mechanism among policies, precedents, and large language model (LLM)-based proxy judges. This mechanism transforms subjective relevance judgments into actionable, multi-dimensional scoring criteria and enables cost-effective, scalable deployment through teacher–student distillation. By systematically resolving semantic ambiguities, SAGE generates high-quality, scalable evaluation signals. Deployment in LinkedIn’s search system demonstrates that SAGE effectively identifies model regressions undetectable by traditional metrics, resulting in a 0.25% increase in daily active users.

Technology Category

Application Category

📝 Abstract
Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \&Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
Problem

Research questions and friction points this paper is trying to address.

relevance evaluation
AI governance
human oversight
large-scale search systems
evaluation gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable AI Governance
LLM Surrogate Judge
Policy-Precedent Calibration
Teacher-Student Distillation
Relevance Evaluation
🔎 Similar Papers
No similar papers found.
B
Benjamin Le
LinkedIn
X
Xueying Lu
LinkedIn
N
Nick Stern
LinkedIn
W
Wenqiong Liu
LinkedIn
I
Igor Lapchuk
LinkedIn
X
Xiang Li
LinkedIn
B
Baofen Zheng
LinkedIn
K
Kevin Rosenberg
LinkedIn
Jiewen Huang
Jiewen Huang
Yale University
DatabaseCloud Computing
Z
Zhe Zhang
LinkedIn
A
Abraham Cabangbang
LinkedIn
S
Satej Milind Wagle
LinkedIn
Jianqiang Shen
Jianqiang Shen
Palo Alto Research Center (PARC)
Artificial IntelligenceMachine LearningIntelligent SystemsSpeech RecognitionNatural Language Processing
R
Raghavan Muthuregunathan
LinkedIn
Abhinav Gupta
Abhinav Gupta
Research Scientist, Google DeepMind
Reinforcement LearningNatural Language ProcessingMachine LearningArtificial Intelligence
M
Mathew Teoh
LinkedIn
Andrew Kirk
Andrew Kirk
UKAEA
Fusion
T
Thomas Kwan
LinkedIn
J
Jingwei Wu
LinkedIn
W
Wenjing Zhang
LinkedIn