π€ AI Summary
High-quality human preference data are scarce and costly to obtain, and existing approaches struggle to simultaneously address the unreliability of self-reward signals and the inefficiency of active learning in utilizing unlabeled data. This work proposes CoAct, a novel framework that synergistically integrates self-reward and active learning for the first time. CoAct leverages large language modelsβ self-consistency to filter reliable self-annotated samples, employs uncertainty-based sampling to identify critical instances requiring human verification, and introduces a capability-aware mechanism to guide the model in generating new instructions within its competence boundary. Evaluated on GSM8K, MATH, and WebInstruct benchmarks, CoAct achieves average performance improvements of 13.25%, 8.19%, and 13.16%, respectively, substantially outperforming current state-of-the-art baselines.
π Abstract
Learning from preference-based feedback has become an effective approach for aligning LLMs across diverse tasks. However, high-quality human-annotated preference data remains expensive and scarce. Existing methods address this challenge through either self-rewarding, which scales by using purely AI-generated labels but risks unreliability, or active learning, which ensures quality through oracle annotation but cannot fully leverage unlabeled data. In this paper, we present CoAct, a novel framework that synergistically combines self-rewarding and active learning through strategic human-AI collaboration. CoAct leverages self-consistency to identify both reliable self-labeled data and samples that require oracle verification. Additionally, oracle feedback guides the model to generate new instructions within its solvable capability. Evaluated on three reasoning benchmarks across two model families, CoAct achieves average improvements of +13.25% on GSM8K, +8.19% on MATH, and +13.16% on WebInstruct, consistently outperforming all baselines.