Taming Variability: Randomized and Bootstrapped Conformal Risk Control for LLMs

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the hallucination and unreliable evaluation issues arising from the inherent stochasticity of large language model (LLM) outputs. We propose a user-controllable risk-constrained framework centered on Compliance Risk Control (CRC), implemented at the API layer via two complementary mechanisms: Batch Bootstrap CRC (BB-CRC) and Randomized Bayesian Weighted Averaging CRC (RBWA-CRC). Without requiring labeled data or model-specific assumptions, CRC transforms LLM output uncertainty into statistically reliable decisions satisfying user-specified risk bounds (e.g., hallucination rate ≤ 5%). Innovatively, we leverage Gram matrix geometry for unsupervised semantic quantification, enabling interpretable risk signals and metrics. Extensive evaluation across four benchmark datasets demonstrates substantial improvements in factual accuracy and evaluator consistency, alongside enhanced threshold stability and reduced computational overhead. To our knowledge, this is the first framework offering statistical rigor, model-agnosticism, and engineering practicality for LLM output risk management.

Technology Category

Application Category

📝 Abstract

We transform the randomness of LLMs into precise assurances using an actuator at the API interface that applies a user-defined risk constraint in finite samples via Conformal Risk Control (CRC). This label-free and model-agnostic actuator manages ship/abstain/escalate actions based solely on a scalar score from opaque outputs. We enhance CRC's computational efficiency and robustness through Batched Bootstrap CRC (BB-CRC) and Randomized Batched Weighted-Average CRC (RBWA-CRC), reducing calibration calls and stabilizing thresholds while maintaining statistical validity. Additionally, we present a semantic quantification method grounded in gram matrix geometry, resulting in interpretable signal and metric design. Together these pieces deliver principled randomness control for LLM hallucination mitigation and LLM-as-judge reliability. Our framework is assessed using four datasets, demonstrating its efficacy in enhancing factual accuracy and measuring LLM-as-judge performance, yielding a simplified and computationally efficient control layer that converts variability into statistical validity.

Problem

Research questions and friction points this paper is trying to address.

Control LLM randomness for precise risk assurance

Enhance computational efficiency of conformal risk control

Mitigate hallucinations and improve LLM-as-judge reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Randomized conformal risk control via API actuator

Batched bootstrap method enhances computational efficiency

Semantic quantification using gram matrix geometry

🔎 Similar Papers

No similar papers found.