Taming Variability: Randomized and Bootstrapped Conformal Risk Control for LLMs

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the hallucination and unreliable evaluation issues arising from the inherent stochasticity of large language model (LLM) outputs. We propose a user-controllable risk-constrained framework centered on Compliance Risk Control (CRC), implemented at the API layer via two complementary mechanisms: Batch Bootstrap CRC (BB-CRC) and Randomized Bayesian Weighted Averaging CRC (RBWA-CRC). Without requiring labeled data or model-specific assumptions, CRC transforms LLM output uncertainty into statistically reliable decisions satisfying user-specified risk bounds (e.g., hallucination rate ≤ 5%). Innovatively, we leverage Gram matrix geometry for unsupervised semantic quantification, enabling interpretable risk signals and metrics. Extensive evaluation across four benchmark datasets demonstrates substantial improvements in factual accuracy and evaluator consistency, alongside enhanced threshold stability and reduced computational overhead. To our knowledge, this is the first framework offering statistical rigor, model-agnosticism, and engineering practicality for LLM output risk management.

Technology Category

Application Category

📝 Abstract
We transform the randomness of LLMs into precise assurances using an actuator at the API interface that applies a user-defined risk constraint in finite samples via Conformal Risk Control (CRC). This label-free and model-agnostic actuator manages ship/abstain/escalate actions based solely on a scalar score from opaque outputs. We enhance CRC's computational efficiency and robustness through Batched Bootstrap CRC (BB-CRC) and Randomized Batched Weighted-Average CRC (RBWA-CRC), reducing calibration calls and stabilizing thresholds while maintaining statistical validity. Additionally, we present a semantic quantification method grounded in gram matrix geometry, resulting in interpretable signal and metric design. Together these pieces deliver principled randomness control for LLM hallucination mitigation and LLM-as-judge reliability. Our framework is assessed using four datasets, demonstrating its efficacy in enhancing factual accuracy and measuring LLM-as-judge performance, yielding a simplified and computationally efficient control layer that converts variability into statistical validity.
Problem

Research questions and friction points this paper is trying to address.

Control LLM randomness for precise risk assurance
Enhance computational efficiency of conformal risk control
Mitigate hallucinations and improve LLM-as-judge reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized conformal risk control via API actuator
Batched bootstrap method enhances computational efficiency
Semantic quantification using gram matrix geometry
🔎 Similar Papers
No similar papers found.
L
Lingyou Pang
Department of Statistics, University of California, Davis
L
Lei Huang
Department of Statistics, University of California, Davis
Jianyu Lin
Jianyu Lin
Sr. Machine Learning Engineer @ Intuitive Surgical, Inc.
Computer VisionMachine LearningMedical Image Analysis
T
Tianyu Wang
Department of Applied Mathematics and Statistics, Johns Hopkins University
A
Alexander Aue
Department of Statistics, University of California, Davis
Carey E. Priebe
Carey E. Priebe
Professor of Applied Mathematics and Statistics, Johns Hopkins University
statistical inference for high-dimensional and graph data