🤖 AI Summary
This study investigates whether Schwartz’s higher-order value structure facilitates sentence-level human value detection and evaluates the efficacy of hard-gated mechanisms under stringent computational constraints. It compares directly supervised models against hard-gated hierarchical pipelines and cascaded architectures, augmented with low-cost enhancements such as lexicons, short-context features, and topic cues. Results indicate that higher-order value categories can be effectively learned from single sentences, achieving a Macro-F₁ of approximately 0.58 for the best bipolar pair. However, hard gating often degrades performance due to error propagation and suppressed recall. In contrast, label-level threshold tuning (+0.05) and lightweight ensembling (+0.02) consistently improve results. The findings suggest that while Schwartz’s higher-order structure provides a useful descriptive framework, it should not be imposed as a rigid architectural constraint.
📝 Abstract
Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO$\rightarrow$values pipelines that enforce the hierarchy with hard masks, and (iii) Presence$\rightarrow$HO$\rightarrow$values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ($\le$10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro-$F_1\approx0.58$), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro-$F_1$ via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to $+0.05$ Macro-$F_1$), and small transformer ensembles provide the most consistent additional gains (up to $+0.02$ Macro-$F_1$). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.