Generative Evaluation of Complex Reasoning in Large Language Models

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing reasoning benchmarks suffer from data contamination, making it difficult to disentangle LLMs’ genuine reasoning capabilities from mere memorization. Method: We propose KUMO, the first generative evaluation framework that integrates LLM-based task generation with a symbolic reasoning engine to dynamically construct multi-step, partially observable reasoning tasks across 100 open domains; tasks are difficulty-controllable and inherently immune to training data contamination. Contribution/Results: KUMO enables the first cross-scale quantitative alignment of LLM reasoning performance with human undergraduate-level reasoning ability. Evaluated on 5,000 novel tasks across 23 state-of-the-art models, we find that reasoning-enhanced models achieve average undergraduate performance on complex tasks. Moreover, KUMO scores exhibit strong correlation (r > 0.89) with real-world reasoning benchmarks, establishing a new standard for trustworthy reasoning evaluation.

Technology Category

Application Category

📝 Abstract

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Assessing if LLMs genuinely reason or just recall from training data

Addressing benchmark contamination in evaluating LLM reasoning

Developing a dynamic framework to test true generalization in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

KUMO combines LLMs with symbolic engines

Generates diverse multi-turn reasoning tasks

Automated pipeline for continuous novel tasks

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting