🤖 AI Summary
This work proposes the first multi-label commonsense reasoning benchmark grounded in logical operators—specifically AND, OR, and NEITHER/NOR—to address the limitations of existing benchmarks that rely on single-label evaluation and fail to capture logical relationships among atomic statements. The task is reformulated as judging the validity of logical combinations over pairs of statements. Using zero-shot, few-shot, and chain-of-thought prompting, the study systematically evaluates the performance of various language models across different reasoning settings. Experimental results reveal that models perform relatively well on conjunctive (AND) reasoning, moderately on disjunctive (OR) reasoning, and significantly worse when negation is involved, highlighting a critical weakness in handling complex logical structures, particularly those involving negation.
📝 Abstract
Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.