Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work systematically evaluates large language models’ (LLMs) capabilities in abstract commonsense reasoning, benchmarking them against human-level understanding. To this end, we construct a standardized evaluation benchmark grounded in the ConceptNet knowledge graph and propose two prompting strategies: instruction-based prompting—leveraging semantic definitions to predict relational links—and few-shot prompting—using exemplars to guide relation identification. Experimental results reveal a critical bottleneck: LLMs exhibit significant performance degradation on single-relation prediction tasks. We further demonstrate that incorporating selective relation constraints and retrieval-augmented prompting effectively mitigates model bias and improves accuracy. Validation on GPT-4o-mini shows marked accuracy gains under five-choice settings with few-shot prompting, though persistent relation preference biases remain. Crucially, this study is the first to uncover structural limitations of LLMs in abstract commonsense reasoning and introduces a novel paradigm for commonsense-aware prompt engineering.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluate abstract common-sense reasoning in LLMs.

Assess LLMs' performance using ConceptNet knowledge graph.

Identify gaps in LLMs' abstract reasoning compared to humans.

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConceptNet knowledge graph utilization

Instruct and few-shot prompting

Selective retrieval prompt engineering

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?

2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97

Authors to Follow