🤖 AI Summary
This work addresses the hallucination and unreliable evaluation issues arising from the inherent stochasticity of large language model (LLM) outputs. We propose a user-controllable risk-constrained framework centered on Compliance Risk Control (CRC), implemented at the API layer via two complementary mechanisms: Batch Bootstrap CRC (BB-CRC) and Randomized Bayesian Weighted Averaging CRC (RBWA-CRC). Without requiring labeled data or model-specific assumptions, CRC transforms LLM output uncertainty into statistically reliable decisions satisfying user-specified risk bounds (e.g., hallucination rate ≤ 5%). Innovatively, we leverage Gram matrix geometry for unsupervised semantic quantification, enabling interpretable risk signals and metrics. Extensive evaluation across four benchmark datasets demonstrates substantial improvements in factual accuracy and evaluator consistency, alongside enhanced threshold stability and reduced computational overhead. To our knowledge, this is the first framework offering statistical rigor, model-agnosticism, and engineering practicality for LLM output risk management.
📝 Abstract
We transform the randomness of LLMs into precise assurances using an actuator at the API interface that applies a user-defined risk constraint in finite samples via Conformal Risk Control (CRC). This label-free and model-agnostic actuator manages ship/abstain/escalate actions based solely on a scalar score from opaque outputs. We enhance CRC's computational efficiency and robustness through Batched Bootstrap CRC (BB-CRC) and Randomized Batched Weighted-Average CRC (RBWA-CRC), reducing calibration calls and stabilizing thresholds while maintaining statistical validity. Additionally, we present a semantic quantification method grounded in gram matrix geometry, resulting in interpretable signal and metric design. Together these pieces deliver principled randomness control for LLM hallucination mitigation and LLM-as-judge reliability. Our framework is assessed using four datasets, demonstrating its efficacy in enhancing factual accuracy and measuring LLM-as-judge performance, yielding a simplified and computationally efficient control layer that converts variability into statistical validity.