🤖 AI Summary
This study evaluates the moral reasoning capabilities of large language models (LLMs) on 30 real-world software engineering (SE) ethics scenarios under zero-shot settings, focusing on explanation stability and theoretical consistency. Method: We propose the first automated evaluation framework for SE ethics reasoning, introducing two novel metrics—“Theory Consistency Rate” and “Binary Acceptability”—and a three-stage zero-shot prompting paradigm comprising ethical theory identification, moral acceptability judgment, and explanation generation. Contribution/Results: Experiments across 16 mainstream LLMs, benchmarked against expert annotations, show an average theory consistency of 73.3%, moral judgment agreement of 86.7%, and significant conceptual convergence in free-text explanations. These findings validate LLMs as lightweight, interpretable, and reliable ethical reasoning engines suitable for integration into SE toolchains.
📝 Abstract
Large Language Models (LLMs) are increasingly integrated into software engineering (SE) tools for tasks that extend beyond code synthesis, including judgment under uncertainty and reasoning in ethically significant contexts. We present a fully automated framework for assessing ethical reasoning capabilities across 16 LLMs in a zero-shot setting, using 30 real-world ethically charged scenarios. Each model is prompted to identify the most applicable ethical theory to an action, assess its moral acceptability, and explain the reasoning behind their choice. Responses are compared against expert ethicists' choices using inter-model agreement metrics. Our results show that LLMs achieve an average Theory Consistency Rate (TCR) of 73.3% and Binary Agreement Rate (BAR) on moral acceptability of 86.7%, with interpretable divergences concentrated in ethically ambiguous cases. A qualitative analysis of free-text explanations reveals strong conceptual convergence across models despite surface-level lexical diversity. These findings support the potential viability of LLMs as ethical inference engines within SE pipelines, enabling scalable, auditable, and adaptive integration of user-aligned ethical reasoning. Our focus is the Ethical Interpreter component of a broader profiling pipeline: we evaluate whether current LLMs exhibit sufficient interpretive stability and theory-consistent reasoning to support automated profiling.