LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB

๐Ÿ“… 2026-04-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

186K/year
๐Ÿค– AI Summary
This study investigates whether large language models (LLMs) rely on shallow heuristics or memorization rather than genuine reasoning when generating software tests, particularly for complex systems absent from their training data. By comparing LLM-generated tests for the open-source LevelDB and the proprietary SAP HANA database, and integrating mutation testing scores, iterative compile-feedback repair loops, and the Mitchell mechanism-focused evaluation framework, this work pioneers the application of mechanism-oriented cognitive science methods to software testing. The findings reveal that while LLMs perform well on familiar systems, their effectiveness degrades significantly on unseen systems, often prioritizing syntactic compilability over semantic correctness. This highlights a lack of robust reasoning capabilities in current LLMs and establishes a novel paradigm for evaluating LLM-based reasoning in software engineering contexts.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
test generation
reasoning
shortcuts
software testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM reasoning
test generation
cognitive evaluation
mutation score
generalization