SKYLENAGE Technical Report: Mathematical Reasoning and Contest-Innovation Benchmarks for Multi-Level Math Evaluation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mathematical evaluation benchmarks suffer from ceiling effects, limiting their ability to discriminate fine-grained reasoning capabilities of state-of-the-art large language models (LLMs). To address this, we propose SKYLENAGE—a dual-benchmark framework comprising SKYLENAGE-ReasoningMATH (emphasizing multi-step deductive reasoning) and SKYLENAGE-MATH (spanning high-school to doctoral-level difficulty). This constitutes the first structural-aware, metadata-rich, contest-style diagnostic evaluation suite across seven mathematical subjects and four difficulty stages. Leveraging fine-grained difficulty calibration, subject categorization, and metadata-driven analysis, we uniformly evaluate 15 leading LLMs. Results reveal that even the strongest model achieves only 44% accuracy on competition-level problems versus 81% on structured reasoning tasks—highlighting a critical bottleneck in higher-order mathematical reasoning. Furthermore, doctoral-level competence exhibits ~79% retention on high-school problems, establishing a novel metric for modeling mathematical reasoning capability transfer.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) now perform strongly on many public math suites, yet frontier separation within mathematics increasingly suffers from ceiling effects. We present two complementary benchmarks: SKYLENAGE-ReasoningMATH, a 100-item, structure-aware diagnostic set with per-item metadata on length, numeric density, and symbolic complexity; and SKYLENAGE-MATH, a 150-item contest-style suite spanning four stages from high school to doctoral under a seven-subject taxonomy. We evaluate fifteen contemporary LLM variants under a single setup and analyze subject x model and grade x model performance. On the contest suite, the strongest model reaches 44% while the runner-up reaches 37%; accuracy declines from high school to doctoral, and top systems exhibit a doctoral-to-high-school retention near 79%. On the reasoning set, the best model attains 81% overall, and hardest-slice results reveal clear robustness gaps between leaders and the mid-tier. In summary, we release SKYLENAGE-ReasoningMATH and report aggregate results for SKYLENAGE-MATH; together, SKYLENAGE provides a hard, reasoning-centered and broadly covering math benchmark with calibrated difficulty and rich metadata, serving as a reference benchmark for future evaluations of mathematical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Addressing ceiling effects in mathematical reasoning evaluation
Providing multi-level math benchmarks with calibrated difficulty
Evaluating LLM performance across educational stages and subjects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware diagnostic benchmark with metadata
Contest-style suite spanning four academic levels
Rich metadata and calibrated difficulty for evaluation
🔎 Similar Papers
No similar papers found.
H
Hu Wei
Alibaba Group
Z
Ze Xu
Alibaba Group
B
Boyu Yang
Alibaba Group
L
Linlin Miao
Alibaba Group
W
Weiqi Zhai
Alibaba Group
Y
Yihan Li
Alibaba Group
Zixuan Li
Zixuan Li
Assistant Professor at ICT, UCAS
Knowledge GraphLarge Language Model
Zhijun Wang
Zhijun Wang
Institute of Physics, Chinese Academy of Sciences
Condensed Matter Physics
Boya Wang
Boya Wang
HHWF Postdoctoral Fellow, California Institute of Technology
Molecular programmingDNA computingThermodynamicsDNA nanotechnology
Jianwei Yu
Jianwei Yu
Tencent AI lab
ASR
J
Jialing Yuan
Alibaba Group
X
Xiaoyue Zhang
Alibaba Group
Cheng He
Cheng He
Alibaba Group
M
Minglei Chen
Alibaba Group
Zifan Zhang
Zifan Zhang
PhD student at NC State
Digital TwinWireless NetworkFederated Learning
Q
Qianhui Li
Alibaba Group
W
Wei Wang
Alibaba Group
X
Xiang Xu
Alibaba Group