🤖 AI Summary
In large language model (LLM) cloud services, token-based billing incentivizes providers to overreport token counts for profit, undermining billing integrity.
Method: We propose the first third-party auditing framework with provable statistical guarantees, leveraging martingale theory to design a sequential hypothesis test. It performs lightweight black-box queries and conducts real-time statistical validation of model outputs, strictly bounding the false positive rate for honest providers at ≤α (e.g., 0.05).
Contribution/Results: This work pioneers the application of martingale inequalities to LLM billing auditing, achieving theoretically grounded trade-offs between detection power and error control. Experiments show the framework reliably detects overreporting after observing only ~70 model outputs on average, with robust performance across major LLMs—including GPT-4, Claude, and Llama—and diverse real-world prompt sets. It delivers a deployable, statistically rigorous auditing infrastructure for trustworthy AI service billing.
📝 Abstract
Millions of users rely on a market of cloud-based services to obtain access to state-of-the-art large language models. However, it has been very recently shown that the de facto pay-per-token pricing mechanism used by providers creates a financial incentive for them to strategize and misreport the (number of) tokens a model used to generate an output. In this paper, we develop an auditing framework based on martingale theory that enables a trusted third-party auditor who sequentially queries a provider to detect token misreporting. Crucially, we show that our framework is guaranteed to always detect token misreporting, regardless of the provider's (mis-)reporting policy, and not falsely flag a faithful provider as unfaithful with high probability. To validate our auditing framework, we conduct experiments across a wide range of (mis-)reporting policies using several large language models from the $ exttt{Llama}$, $ exttt{Gemma}$ and $ exttt{Ministral}$ families, and input prompts from a popular crowdsourced benchmarking platform. The results show that our framework detects an unfaithful provider after observing fewer than $sim 70$ reported outputs, while maintaining the probability of falsely flagging a faithful provider below $α= 0.05$.