SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing mathematical evaluation benchmarks suffer from ceiling effects, limiting their ability to discriminate fine-grained reasoning capabilities of state-of-the-art large language models (LLMs). To address this, we propose SKYLENAGE—a dual-benchmark framework comprising SKYLENAGE-ReasoningMATH (emphasizing multi-step deductive reasoning) and SKYLENAGE-MATH (spanning high-school to doctoral-level difficulty). This constitutes the first structural-aware, metadata-rich, contest-style diagnostic evaluation suite across seven mathematical subjects and four difficulty stages. Leveraging fine-grained difficulty calibration, subject categorization, and metadata-driven analysis, we uniformly evaluate 15 leading LLMs. Results reveal that even the strongest model achieves only 44% accuracy on competition-level problems versus 81% on structured reasoning tasks—highlighting a critical bottleneck in higher-order mathematical reasoning. Furthermore, doctoral-level competence exhibits ~79% retention on high-school problems, establishing a novel metric for modeling mathematical reasoning capability transfer.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addressing ceiling effects in mathematical reasoning evaluation

Providing multi-level math benchmarks with calibrated difficulty

Evaluating LLM performance across educational stages and subjects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware diagnostic benchmark with metadata

Contest-style suite spanning four academic levels

Rich metadata and calibrated difficulty for evaluation

🔎 Similar Papers

No similar papers found.