Automated Capability Discovery via Model Self-Exploration

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Automated, systematic discovery of latent capabilities and risks in foundation models (e.g., GPT, Claude, Llama) remains challenging. Method: We propose the Automated Capability Discovery (ACD) framework, which endows large language models with a “scientist” role to autonomously generate open-ended tasks and evaluate capabilities—and failure modes—of target models (including themselves), enabling fully automated, human-free capability mapping. ACD introduces a novel model-driven probing paradigm integrating dual generative–evaluative loops and consistency calibration, validated via human evaluation to establish a reliable automated scoring system. Results: ACD automatically identifies thousands of implicit capabilities and failure patterns across multiple model families. Automated scores exhibit strong agreement with human judgments (Cohen’s κ > 0.85). The implementation, including full experimental logs, is publicly released.

Technology Category

Application Category

📝 Abstract

Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of capabilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers both surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically reveals thousands of capabilities that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.

Problem

Research questions and friction points this paper is trying to address.

Automated discovery of model capabilities

Systematic evaluation without human intervention

Scalable assessment of AI system performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated task creation and evaluation

Foundation model self-exploration

Scalable AI system assessment

🔎 Similar Papers

No similar papers found.