🤖 AI Summary
Prior pragmatics and LLM research predominantly examines cooperative communication, neglecting systematic evaluation of strategic language understanding in high-stakes, non-cooperative contexts—such as courtroom cross-examinations. Method: We propose CoBRA, the first framework to quantitatively assess LLMs’ pragmatic competence in adversarial discourse, introducing three interpretable metrics—BaT (Baseline Truthfulness), PaT (Pragmatic Adaptivity), and NRBaT (Non-Responsive Baseline Truthfulness)—and curating CHARM, the first annotated real-world courtroom cross-examination dataset. Contribution/Results: We find that mainstream LLMs exhibit pervasive strategic pragmatic deficits; counterintuitively, reasoning-augmented models show significantly degraded performance, while scaling parameters yields only marginal gains. These results uncover a potential tension between pragmatic and logical reasoning capabilities in LLMs, providing both theoretical insight and empirical benchmarks for developing trustworthy language models in non-cooperative settings.
📝 Abstract
Language is often used strategically, particularly in high-stakes, adversarial settings, yet most work on pragmatics and LLMs centers on cooperativity. This leaves a gap in systematic understanding of non-cooperative discourse. To address this, we introduce CoBRA (Cooperation-Breach Response Assessment), along with three interpretable metrics -- Benefit at Turn (BaT), Penalty at Turn (PaT), and Normalized Relative Benefit at Turn (NRBaT) -- to quantify the perceived strategic effects of discourse moves. We also present CHARM, an annotated dataset of real courtroom cross-examinations, to demonstrate the framework's effectiveness. Using these tools, we evaluate a range of LLMs and show that LLMs generally exhibit limited pragmatic understanding of strategic language. While model size shows an increase in performance on our metrics, reasoning ability does not help and largely hurts, introducing overcomplication and internal confusion.