🤖 AI Summary
This study identifies systematic deficiencies in large language models (LLMs) for moral foundation detection: high false-negative rates and severe under-detection of moral content, limiting their practical utility in moral reasoning. We conduct the first systematic comparison between LLMs and task-specific fine-tuned Transformer models—trained on Twitter/Reddit data—employing multi-dimensional evaluation via ROC, precision-recall (PR), and detection error tradeoff (DET) curves. Results demonstrate that prompt engineering fails to overcome inherent architectural limitations. Our key contributions are: (1) empirical evidence that fine-tuned models significantly outperform all tested LLMs across all metrics, affirming the irreplaceability of task-oriented fine-tuning; and (2) a principled argument that moral detection requires domain-adapted modeling beyond generic capabilities, underscoring the necessity of specialized architectural and training strategies for morally grounded NLP tasks.
📝 Abstract
Moral foundation detection is crucial for analyzing social discourse and developing ethically-aligned AI systems. While large language models excel across diverse tasks, their performance on specialized moral reasoning remains unclear.
This study provides the first comprehensive comparison between state-of-the-art LLMs and fine-tuned transformers across Twitter and Reddit datasets using ROC, PR, and DET curve analysis.
Results reveal substantial performance gaps, with LLMs exhibiting high false negative rates and systematic under-detection of moral content despite prompt engineering efforts. These findings demonstrate that task-specific fine-tuning remains superior to prompting for moral reasoning applications.