🤖 AI Summary
Existing LLM evaluation faces two key bottlenecks in answer verification: (1) the absence of systematic, comprehensive benchmarks, and (2) insufficient robustness and cross-domain generalization of current verifiers. To address these, we propose CompassVerifier—the first lightweight, efficient, and unified answer verifier supporting mathematical reasoning, factual knowledge, and diverse reasoning tasks, capable of handling multi-step subproblems, symbolic expressions, sequential outputs, and anomalous responses. We introduce VerifierBench, the first systematic verification benchmark, constructed via meta-error pattern analysis to enhance training data diversity and realism. CompassVerifier integrates dual mechanisms—structured matching and semantic discrimination—leveraging multimodal inputs and human-annotated supervision. Extensive experiments demonstrate that CompassVerifier significantly outperforms rule-based methods and general-purpose LLM verifiers across domains, exhibiting strong generalization and robustness against adversarial perturbations and output irregularities. Code and data are publicly released.
📝 Abstract
Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.