The Moral Gap of Large Language Models

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study identifies systematic deficiencies in large language models (LLMs) for moral foundation detection: high false-negative rates and severe under-detection of moral content, limiting their practical utility in moral reasoning. We conduct the first systematic comparison between LLMs and task-specific fine-tuned Transformer models—trained on Twitter/Reddit data—employing multi-dimensional evaluation via ROC, precision-recall (PR), and detection error tradeoff (DET) curves. Results demonstrate that prompt engineering fails to overcome inherent architectural limitations. Our key contributions are: (1) empirical evidence that fine-tuned models significantly outperform all tested LLMs across all metrics, affirming the irreplaceability of task-oriented fine-tuning; and (2) a principled argument that moral detection requires domain-adapted modeling beyond generic capabilities, underscoring the necessity of specialized architectural and training strategies for morally grounded NLP tasks.

Technology Category

Application Category

📝 Abstract

Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear. This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis. Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' performance on moral reasoning tasks

Comparing fine-tuned transformers vs LLMs in moral detection

Identifying gaps in LLMs' moral content detection accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive comparison of LLMs and fine-tuned transformers

Analysis using ROC, PR, and DET curve metrics

Task-specific fine-tuning outperforms prompt engineering

🔎 Similar Papers

A Survey on Moral Foundation Theory and Pre-Trained Language Models: Current Advances and Challenges