Minimizing Human Intervention in Online Classification

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This paper addresses online query classification in question-answering systems with minimal human intervention: given d-dimensional query embeddings, an agent must sequentially classify queries and consult costly human experts only when necessary, minimizing regret against the optimal labeling policy. We propose the Conservative Convex Hull Classifier (CHC) and Generalized Convex Hull Classifier (GHC), which dynamically decide expert invocation via convex hull maintenance and adaptive thresholding. For sub-Gaussian mixture distributions, we design the Center Classifier (CC), providing provable regret bounds. CHC achieves an O(logᵈT) regret bound for T ≥ exp(d), attaining minimax optimality in the univariate case (d = 1). Empirically, GHC significantly outperforms baselines on real-world QA data, achieving superior trade-offs between classification accuracy and human annotation cost.

Technology Category

Application Category

📝 Abstract

We introduce and study an online problem arising in question answering systems. In this problem, an agent must sequentially classify user-submitted queries represented by $d$-dimensional embeddings drawn i.i.d. from an unknown distribution. The agent may consult a costly human expert for the correct label, or guess on her own without receiving feedback. The goal is to minimize regret against an oracle with free expert access. When the time horizon $T$ is at least exponential in the embedding dimension $d$, one can learn the geometry of the class regions: in this regime, we propose the Conservative Hull-based Classifier (CHC), which maintains convex hulls of expert-labeled queries and calls the expert as soon as a query lands outside all known hulls. CHC attains $mathcal{O}(log^d T)$ regret in $T$ and is minimax optimal for $d=1$. Otherwise, the geometry cannot be reliably learned without additional distributional assumptions. We show that when the queries are drawn from a subgaussian mixture, for $T le e^d$, a Center-based Classifier (CC) achieves regret proportional to $Nlog{N}$ where $N$ is the number of labels. To bridge these regimes, we introduce the Generalized Hull-based Classifier (GHC), a practical extension of CHC that allows for more aggressive guessing via a tunable threshold parameter. Our approach is validated with experiments, notably on real-world question-answering datasets using embeddings derived from state-of-the-art large language models.

Problem

Research questions and friction points this paper is trying to address.

Minimizing human expert consultations in online classification systems

Sequentially classifying queries without receiving feedback on guesses

Achieving low regret against an oracle with free expert access

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conservative Hull-based Classifier minimizes expert queries

Generalized Hull-based Classifier enables tunable threshold guessing

Center-based Classifier handles subgaussian mixture distributions

🔎 Similar Papers

Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models