Evaluating Large Language Models on Computer Science University Exams in Data Structures

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study evaluates the problem-solving capabilities of large language models on university-level data structures examination questions. To this end, the authors construct the first closed-book, multiple-choice benchmark dataset derived from the Data Structures course at Tel Aviv University and systematically assess several prominent models—including GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B—on this benchmark. The findings illuminate both the current performance and inherent limitations of these models in tackling core computer science problems, thereby addressing a critical gap in domain-specific evaluation benchmarks for foundational computing curricula. Moreover, the work provides empirical evidence supporting the potential integration of large language models into higher education contexts.

Technology Category

Application Category

📝 Abstract

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Computer Science Education

Data Structures

Exam Evaluation

Benchmark Dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark dataset

Large Language Models

Computer Science education