๐ค AI Summary
This study investigates whether large language models (LLMs) rely on shallow heuristics or memorization rather than genuine reasoning when generating software tests, particularly for complex systems absent from their training data. By comparing LLM-generated tests for the open-source LevelDB and the proprietary SAP HANA database, and integrating mutation testing scores, iterative compile-feedback repair loops, and the Mitchell mechanism-focused evaluation framework, this work pioneers the application of mechanism-oriented cognitive science methods to software testing. The findings reveal that while LLMs perform well on familiar systems, their effectiveness degrades significantly on unseen systems, often prioritizing syntactic compilability over semantic correctness. This highlights a lack of robust reasoning capabilities in current LLMs and establishes a novel paradigm for evaluating LLM-based reasoning in software engineering contexts.
๐ Abstract
Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying reasoning strategies. Results show that LLMs excel on familiar, open-source benchmarks but struggle with unseen, complex domains, often prioritizing compilability over semantic effectiveness. These findings provide independent software engineering evidence for the broader claim that current LLMs lack robust reasoning, and highlight the need for evaluation frameworks that penalize trivial shortcuts and reward true generalization.