Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the challenge of achieving round-by-round, non-aggregated selective risk control with pathwise validity in dynamic, online deployment of expert large language models fine-tuned via RLVR on local data within regulated organizations. To this end, the paper introduces the Conformal Selective Acting (CSA) framework, which constitutes the first model-agnostic deployment wrapper that requires no modification to the underlying model. CSA leverages a Bonferroni grid to maintain a Ville-type e-process, ensuring anytime-pathwise valid selective risk control under assumptions of predictable updates and monotonically calibrated risk. Extensive experiments across eight expert benchmarks, sixteen adversarial distribution shifts, and five online Expert-Iteration RLVR settings—encompassing over 10,000 rounds—demonstrate that CSA is the only method consistently satisfying pathwise validity while enabling continuous deployment, thereby bridging a critical theoretical and practical gap in risk-controlled dynamic deployment.
📝 Abstract
A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $α$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\leα+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $Θ(\barη^{-2}\log(1/δ))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.
Problem

Research questions and friction points this paper is trying to address.

conformal prediction
selective risk
anytime validity
RLVR
safety certification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conformal Selective Acting
anytime-valid inference
selective risk control
e-process
RLVR-trained LLMs