Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of reliable and reproducible benchmarks for subjectively evaluating behavioral traits—such as empathy and restraint—in large language models (LLMs), a problem often exacerbated by low-confidence assessments from single human raters or circular validation via homologous model judges. To overcome these limitations, the authors propose a “reproducibility-first” evaluation paradigm that data-drivenly evolves a nine-dimensional emotional companionship scoring framework through four orthogonal mechanisms: cross-run reliability, consistency across heterogeneous model judges, historical generational calibration, and preregistered predictions. Experiments across 49 models reveal subtle degradations masked by aggregate metrics—e.g., GPT-5 exhibits significantly weaker advisory restraint than GPT-4.1—and demonstrate high robustness across five judge types, 17 generational months, and 74 real-world dialogues (rho ∈ [0.749, 0.850]), with the evaluation instrument achieving ordinal Krippendorff’s α = 0.91.
📝 Abstract
Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target's training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree. We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties -- reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected. Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint -- whether a model refrains from giving unsolicited solutions in empathic contexts -- gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).
Problem

Research questions and friction points this paper is trying to address.

LLM behavioral benchmarking
subjective evaluation
replication
empathy
restraint
Innovation

Methods, ideas, or system contributions that make the work stand out.

replication-first paradigm
LLM behavioral benchmarking
cross-instrument replication
pre-registered prediction
emotional accompaniment
🔎 Similar Papers
No similar papers found.