GEAR: A General Evaluation Framework for Abductive Reasoning

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Evaluating abductive reasoning capabilities of large language models (LLMs) remains challenging due to reliance on manual annotation, gold-standard answers, and opaque, task-specific metrics. Method: We propose GEAR—a general, transparent, annotation-free open-ended evaluation framework that quantifies hypothesis set quality along three dimensions: consistency, generalizability, and diversity. GEAR introduces a novel momentum-based curriculum learning mechanism that dynamically adapts training objectives according to model learning speed. Contribution/Results: Evaluated across four benchmarks on nine mainstream LLMs, GEAR generates over 50,000 hypotheses, significantly improving abductive performance with transferable gains to existing benchmarks. It uncovers fine-grained model capability differences obscured by conventional evaluation methods, establishing a scalable, reliable, and continuous assessment paradigm for LLM abductive reasoning.

Technology Category

Application Category

📝 Abstract

Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate new knowledge through abductive reasoning

Developing automated framework to assess hypothesis consistency, generalizability and diversity

Providing label-free training signals to improve LLMs' abductive reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated label-free evaluation framework for abductive reasoning

Scoring hypotheses via consistency, generalizability, and diversity metrics

Momentum-based curriculum adjusts training data using learning velocity

🔎 Similar Papers

Diagnostic Reasoning in Natural Language: Computational Model and Application

2024-09-09arXiv.orgCitations: 0