π€ AI Summary
This work addresses the common practice in natural language processing of discarding contradictions as errors, thereby overlooking humansβ ability to reconcile conflicting statements through explanatory reasoning. To bridge this gap, we propose the task of reconciliatory explanation generation, which requires models to produce plausible explanations that render contradictory statements compatible. We provide the first systematic formulation and evaluation of large language models (LLMs) on this task, innovatively repurposing existing natural language inference (NLI) datasets to construct a new benchmark and designing scalable automatic metrics for assessment. Experiments across 18 LLMs reveal limited overall performance, with diminishing returns from increased model scale due to longer inference times yielding only marginal gains.
π Abstract
Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesize explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesize such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.