🤖 AI Summary
Existing query clustering methods rely on semantic labels or embeddings, which often fail to accurately capture the actual capabilities required by large language models (LLMs) to execute queries, leading to a disconnect between capability assessment and model performance. This work proposes the Embedding-Calibrated Clustering (ECC) algorithm, which uniquely integrates model posterior performance into the semantic clustering process. ECC calibrates prior semantic embeddings, introduces a trainable mixture-weight mechanism to model queries requiring multiple capabilities, and constructs per-cluster capability profiles using the Bradley–Terry model, enabling capability-aware query clustering and query-level capability inference. Experiments demonstrate that ECC improves average accuracy by 17.64 and 18.02 percentage points over human-annotated and embedding-based baselines, respectively, on LLM capability ranking tasks, and significantly outperforms existing methods in downstream applications such as query routing.
📝 Abstract
Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.