🤖 AI Summary
This study investigates the feasibility of end-to-end, function-level component generation by large language models (LLMs) in serverless (FaaS) architectures. Addressing the lack of systematic evaluation and insufficient contextual modeling in prior work, we propose a context-enhanced generation framework tailored for serverless environments: it injects multi-level system context and masks real open-source functions to guide LLMs (e.g., CodeLlama, GPT-4) in producing semantically consistent function completions. We introduce the first joint evaluation framework integrating software engineering (SE) and natural language processing (NLP) metrics—assessing functional correctness, CodeBLEU (≥0.61), test pass rate, and maintainability. Evaluated across multiple real-world serverless repositories, our approach achieves up to 38.2% end-to-end test pass rate; generated code quality approaches human-developed standards. This validates a novel architectural generation paradigm—from design decisions directly to deployable artifacts.
📝 Abstract
Recently, the exponential growth in capability and pervasiveness of Large Language Models (LLMs) has led to significant work done in the field of code generation. However, this generation has been limited to code snippets. Going one step further, our desideratum is to automatically generate architectural components. This would not only speed up development time, but would also enable us to eventually completely skip the development phase, moving directly from design decisions to deployment. To this end, we conduct an exploratory study on the capability of LLMs to generate architectural components for Functions as a Service (FaaS), commonly known as serverless functions. The small size of their architectural components make this architectural style amenable for generation using current LLMs compared to other styles like monoliths and microservices. We perform the study by systematically selecting open source serverless repositories, masking a serverless function and utilizing state of the art LLMs provided with varying levels of context information about the overall system to generate the masked function. We evaluate correctness through existing tests present in the repositories and use metrics from the Software Engineering (SE) and Natural Language Processing (NLP) domains to evaluate code quality and the degree of similarity between human and LLM generated code respectively. Along with our findings, we also present a discussion on the path forward for using GenAI in architectural component generation.