U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

📅 2024-12-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) university-level mathematical competence due to limited scale, low difficulty (predominantly K–12), narrow disciplinary coverage, and absence of multimodal items. Method: We introduce U-MATH—the first comprehensive, university-mathematics-focused evaluation benchmark—comprising 1,100 original open-ended problems spanning six core disciplines, with 20% being image-text multimodal questions; alongside it, we construct μ-MATH, a discriminative-task benchmark. Our approach pioneers the systematic integration of advanced undergraduate mathematics content with multimodal reasoning, proposing a novel LLM-based automatic solution evaluation paradigm. Results: Experiments reveal that state-of-the-art LLMs achieve only 63% accuracy on textual problems and a sharp drop to 45% on visual ones; the best discriminative model attains merely 80% F1 on μ-MATH—demonstrating severe limitations in LLMs’ higher-order mathematical reasoning capabilities.

Technology Category

Application Category

📝 Abstract
The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $mu$-MATH.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
University-Level Mathematics
Assessment Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

U-MATH
µ-MATH
multimodal math problems
🔎 Similar Papers
No similar papers found.
K
Konstantin Chernyshev
Toloka AI
V
Vitaliy Polshkov
Toloka AI
Ekaterina Artemova
Ekaterina Artemova
Toloka.AI, ex-HSE, ex-LMU
natural language processingbenchmarkinglarge language models
S
Sergei Tilga
Toloka AI
A
Alex Myasnikov
Gradarius
V
Vlad Stepanov
Gradarius
A
Alexei G. Myasnikov
Gradarius, Stevens Institute of Technology