Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Humans exhibit cognitive uncertainty in moral dilemmas, whereas large language models (LLMs) often display overconfidence—undermining their trustworthiness in ethical decision-making. Method: We propose a novel inference-time uncertainty modulation paradigm: integrating dropout during autoregressive generation and quantifying uncertainty via a binary entropy framework—specifically, a linear combination of total entropy, conditional entropy, and mutual information. Contribution/Results: Experiments on classic moral judgment tasks (e.g., the trolley problem) demonstrate substantial improvement in LLM alignment with human moral preferences. Crucially, changes in mutual information strongly correlate with alignment gains, confirming that mitigating overconfidence enhances ethical reasoning quality. To our knowledge, this is the first work to integrate controllable uncertainty modeling with human–machine moral alignment, offering a principled pathway toward reliable, interpretable ethical AI systems.

Technology Category

Application Category

📝 Abstract

Humans display significant uncertainty when confronted with moral dilemmas, yet the extent of such uncertainty in machines and AI agents remains underexplored. Recent studies have confirmed the overly confident tendencies of machine-generated responses, particularly in large language models (LLMs). As these systems are increasingly embedded in ethical decision-making scenarios, it is important to understand their moral reasoning and the inherent uncertainties in building reliable AI systems. This work examines how uncertainty influences moral decisions in the classical trolley problem, analyzing responses from 32 open-source models and 9 distinct moral dimensions. We first find that variance in model confidence is greater across models than within moral dimensions, suggesting that moral uncertainty is predominantly shaped by model architecture and training method. To quantify uncertainty, we measure binary entropy as a linear combination of total entropy, conditional entropy, and mutual information. To examine its effects, we introduce stochasticity into models via "dropout" at inference time. Our findings show that our mechanism increases total entropy, mainly through a rise in mutual information, while conditional entropy remains largely unchanged. Moreover, this mechanism significantly improves human-LLM moral alignment, with correlations in mutual information and alignment score shifts. Our results highlight the potential to better align model-generated decisions and human preferences by deliberately modulating uncertainty and reducing LLMs' confidence in morally complex scenarios.

Problem

Research questions and friction points this paper is trying to address.

Examining moral uncertainty in AI systems during ethical decision-making scenarios

Quantifying uncertainty in LLM responses to moral dilemmas like trolley problems

Improving human-LLM moral alignment by modulating model confidence levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing dropout at inference time

Quantifying uncertainty via binary entropy measurement

Modulating model confidence to improve moral alignment

🔎 Similar Papers

No similar papers found.