🤖 AI Summary
This study addresses the challenge of disentangling individual leadership traits—such as entrepreneurial orientation—from generic corporate rhetoric in Chinese state-owned enterprise executives’ speeches, a limitation compounded by the absence of systematic validity diagnostics for existing measurement approaches. To tackle this, the authors propose a label-efficient diagnostic framework leveraging naturally occurring paired speech data from multiple speakers within the same firm as a quasi-experimental setting. They evaluate whether lexical dictionaries, topic models, embedding-based similarity, and a large language model (Qwen3.5-9B) capture genuine individual variation rather than firm-level fixed effects. Results indicate that approximately half of the signal in prevailing methods stems from leaders’ idiosyncratic rhetorical styles. Applying confidence-weighted calibration and style-residualization reduces Qwen3.5-9B’s Cohen’s d on the paired task from 1.09 to 0.43. The project releases an open-source corpus of 2,190 annotated excerpts, an automatically mined slogan dictionary, and a complete evaluation toolkit.
📝 Abstract
Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.