Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humans exhibit cognitive uncertainty in moral dilemmas, whereas large language models (LLMs) often display overconfidence—undermining their trustworthiness in ethical decision-making. Method: We propose a novel inference-time uncertainty modulation paradigm: integrating dropout during autoregressive generation and quantifying uncertainty via a binary entropy framework—specifically, a linear combination of total entropy, conditional entropy, and mutual information. Contribution/Results: Experiments on classic moral judgment tasks (e.g., the trolley problem) demonstrate substantial improvement in LLM alignment with human moral preferences. Crucially, changes in mutual information strongly correlate with alignment gains, confirming that mitigating overconfidence enhances ethical reasoning quality. To our knowledge, this is the first work to integrate controllable uncertainty modeling with human–machine moral alignment, offering a principled pathway toward reliable, interpretable ethical AI systems.

Technology Category

Application Category

📝 Abstract
Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.
Problem

Research questions and friction points this paper is trying to address.

Examining moral uncertainty in AI systems during ethical decision-making scenarios
Quantifying uncertainty in LLM responses to moral dilemmas like trolley problems
Improving human-LLM moral alignment by modulating model confidence levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing dropout at inference time
Quantifying uncertainty via binary entropy measurement
Modulating model confidence to improve moral alignment
🔎 Similar Papers
No similar papers found.