Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation

📅 2025-07-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Commercial large language model (LLM) APIs conceal their internal inference processes yet charge per total generated token—including intermediate reasoning tokens—leading to token inflation and overcharging risks. Existing auditing approaches are limited: cryptographic verification requires provider cooperation, while user-side prediction methods fail to robustly handle cross-domain and stylistically diverse token consumption patterns. This paper proposes PALACE, the first user-side, prediction-based, verifiable auditing framework that requires no internal API access. PALACE models implicit token consumption via supervised learning on prompt-answer pairs. It introduces a lightweight domain router and a GRPO-enhanced adaptive module to dynamically calibrate for task heterogeneity. Evaluated across mathematical reasoning, programming, medical QA, and general reasoning benchmarks, PALACE achieves low relative error and high prediction accuracy, enabling fine-grained cost verification and inflation detection. The framework advances transparency and standardization in LLM service billing.

Technology Category

Application Category

📝 Abstract
Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token, including those from hidden intermediate steps, raising concerns of token inflation and potential overbilling. This gap underscores the urgent need for reliable token auditing, yet achieving it is far from straightforward: cryptographic verification (e.g., hash-based signature) offers little assurance when providers control the entire execution pipeline, while user-side prediction struggles with the inherent variance of reasoning LLMs, where token usage fluctuates across domains and prompt styles. To bridge this gap, we present PALACE (Predictive Auditing of LLM APIs via Reasoning Token Count Estimation), a user-side framework that estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces. PALACE introduces a GRPO-augmented adaptation module with a lightweight domain router, enabling dynamic calibration across diverse reasoning tasks and mitigating variance in token usage patterns. Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy, supporting both fine-grained cost auditing and inflation detection. Taken together, PALACE represents an important first step toward standardized predictive auditing, offering a practical path to greater transparency, accountability, and user trust.
Problem

Research questions and friction points this paper is trying to address.

Auditing hidden token usage in LLM APIs
Predicting token counts without internal traces
Reducing variance in token usage estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Estimates hidden token counts from prompt-answer pairs
Uses GRPO-augmented module with lightweight domain router
Achieves low error in diverse reasoning benchmarks
🔎 Similar Papers
No similar papers found.