🤖 AI Summary
This work investigates whether large language models (LLMs) possess genuine analogical reasoning capability, focusing on the central question: Are self-generated relevant examples more effective than random ones? Systematic evaluation across multitask benchmarks—including GSM8K—reveals that example **accuracy**, not relevance, is the primary driver of reasoning performance: random examples outperform relevant ones by 4% on GSM8K. To address this, we propose two novel methods: (1) an LLM-based context-aware mechanism for selecting accurate in-context examples, and (2) a lightweight reasoning framework integrating ablation analysis with inference cost optimization. Experiments demonstrate that our approach maintains or improves task performance while substantially reducing computational overhead. These findings challenge conventional assumptions about analogical reasoning in LLMs and provide both mechanistic insight and practical tools for efficient, accuracy-driven reasoning.
📝 Abstract
Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts. However, it is yet not clear whether relevance is the key factor eliciting such capability, i.e., can LLMs benefit more from self-generated relevant examples than irrelevant ones? In this work, we systematically explore whether LLMs can truly perform analogical reasoning on a diverse set of reasoning tasks. With extensive experiments and analysis, we show that self-generated random examples can surprisingly achieve comparable or even better performance on certain tasks, e.g., 4% performance boost on GSM8K with random biological examples. We find that the accuracy of self-generated examples is the key factor and subsequently design two novel methods with improved performance and significantly reduced inference costs. Overall, we aim to advance a deeper understanding of LLM analogical reasoning and hope this work stimulates further research in the design of self-generated contexts.