🤖 AI Summary
This study addresses the challenge of fine-grained identification of the 19 human values from Schwartz’s motivational continuum in context-scarce news and political manifestos, where moral cues are sparse and class distributions are highly imbalanced. The authors propose a moral presence pre-screening mechanism that first employs binary classification to determine whether a sentence expresses any value, followed by multi-label classification using an ensemble of lightweight lexical signals (LIWC-22, eMFD, MJD), topic features, and a compact DeBERTa-base model. Experimental results demonstrate that this approach significantly outperforms complex hierarchical models on a single 8GB GPU, achieving an F1 score of 0.74 for moral presence detection and a macro F1 of 0.332 for the supervised ensemble—surpassing existing English-language baselines. The work also systematically validates the effectiveness of 7–9B instruction-tuned LLMs under zero-shot, few-shot, and QLoRA settings.
📝 Abstract
We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral-presence label and a 19-way multi-label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa-base classifier attains positive-class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi-label value detectors with presence-gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals - short-range context, LIWC-22 and moral lexica, and topic features - and small ensembles. Our best supervised configuration, a soft-voting ensemble of DeBERTa-based models enriched with such signals, reaches macro-F1 = 0.332 on the 19 values, improving over the best previous English-only baseline on this corpus (macro-F1 $\approx$ 0.28). We additionally benchmark 7-9B instruction-tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero-/few-shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute-efficient, value-aware NLP models under realistic GPU budgets.