UCRBench: Benchmarking LLMs on Use Case Recovery

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack behaviorally grounded benchmarks for reverse-generating user use cases from source code, leading to insufficiently rigorous evaluation. This paper introduces UCRBench—the first code-aligned use case benchmark—comprising nine real-world software projects and enabling systematic assessment of both high-level user goals and fine-grained sub-functional use cases. We construct a manually validated code–use-case alignment dataset and propose a hierarchical evaluation protocol measuring role correctness, name accuracy, path fidelity, and behavioral coverage. Leveraging functional semantic analysis and structural consistency modeling, we quantitatively characterize LLMs’ boundaries in cross-project, multi-module, and domain-specific use case reverse-engineering tasks—revealing critical limitations including high omission rates and abstraction inconsistencies.

Technology Category

Application Category

📝 Abstract
Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models (LLMs) in generating use cases from source code. We address this gap by introducing code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases across nine real-world software projects. Using this benchmark, we conduct the first systematic study of LLMs and propose a hierarchical evaluation protocol that assesses actor correctness, name accuracy, path fidelity, and behavioral coverage. The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects, with particularly noticeable shortcomings in domain-specific and multi-module systems. The models also exhibit high omission rates and struggle to maintain consistent abstraction when aggregating subfunctions into user-goal use cases, highlighting both the potential and current limitations of LLM-based use case reverse engineering.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLMs for use case recovery from source code
Assessing LLMs' accuracy in generating user-goal and subfunction use cases
Evaluating LLMs' limitations in domain-specific and multi-module systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-aligned use case benchmarks from real projects
Hierarchical evaluation protocol for LLM performance
Systematic study of LLM-based use case reverse engineering
S
Shuyuan Xiao
East China Normal University, Shanghai, China
Y
Yiran Zhang
Nanyang Technological University, Singapore
Weisong Sun
Weisong Sun
Nanyang Technological University
Trustworthy Intelligent SE (Software Engineering)
X
Xiaohong Chen
East China Normal University, Shanghai, China
Y
Yang Liu
Nanyang Technological University, Singapore
Zhi Jin
Zhi Jin
Sun Yat-Sen University, Associate Professor