๐ค AI Summary
Large language models (LLMs) still significantly underperform humans in generating high-quality instructional analogies and lack a systematic framework for both generation and evaluation. This work proposes the first modular analogy generation pipeline grounded in structure-mapping theory, using sub-concepts as anchoring points and decomposing the task into four stages: source concept retrieval, sub-concept generation, explanation construction, and quality assessment. By integrating 12 mainstream LLMs, seven embedding models, and an LLM-as-a-judge evaluation mechanism, experiments on the SCAR and ParallelPARC datasets demonstrate that incorporating sub-concepts substantially improves both explanatory quality and retrieval accuracy. Among the evaluated models, Claude Sonnet 4.6 achieves the highest alignment with human judgments in analogy ranking tasks.
๐ Abstract
Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (LLMs) continue to struggle to generate analogies of comparable quality to those produced by humans. We present a modular pipeline for educational analogy generation, decomposing the task into four stages: source finding, sub-concept generation, explanation generation, and evaluation. Grounded in Structure Mapping Theory, the pipeline enables systematic, stage-by-stage analysis of how model choice and input configuration affect analogy quality. We evaluate 12 state-of-the-art LLMs across six model families on two datasets with structured sub-concept annotations (SCAR and ParallelPARC), alongside seven embedding models for closed-setting retrieval. Our results show that sub-concepts substantially improve explanation quality and closed setting retrieval precision but provide limited benefit in open-ended source generation. We further introduce an LLM-as-a-judge evaluation methodology and validate its scoring against human annotations from seven annotators, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with fine-grained absolute scores. Taken together, our findings reveal cross-stage interactions that isolated studies cannot capture, and highlight sub-concept grounding as a key driver of analogy quality generation.