General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the notable gap in general-purpose reasoning capabilities of large language models (LLMs), which, despite excelling in specialized domains, still struggle when expert knowledge is unavailable. To systematically disentangle reasoning proficiency from domain-specific expertise, the authors introduce General365, a benchmark centered on K–12-level general reasoning tasks. It comprises 365 seed questions across eight categories and 1,095 carefully crafted variants, supported by a human-designed question framework, variant generation strategy, and standardized evaluation protocol. Experiments on 26 mainstream models reveal that even the best-performing model achieves only 62.8% accuracy—significantly lower than its performance in expert domains—highlighting persistent limitations in handling complex constraints, nested logical structures, and semantic interference.

Technology Category

Application Category

📝 Abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

Problem

Research questions and friction points this paper is trying to address.

general reasoning

large language models

benchmarking

domain-general reasoning

reasoning evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

general reasoning

large language models

reasoning benchmark