What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning

📅 2024-09-19
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of quantitatively evaluating large language models’ (LLMs) potential for autonomous knowledge acquisition—specifically, their curiosity-driven ability to formulate probing questions when confronted with novel scientific statements, including those containing errors. Method: We introduce the first curiosity-driven questioning framework for LLM knowledge acquisition assessment. It leverages a 1,968-statement interdisciplinary synthetic dataset, employs prompt engineering and controlled ablation studies, and integrates multi-dimensional automated scoring (depth, feasibility, novelty) with human evaluation (weighted Cohen’s kappa ≈ 0.7). Contribution/Results: Contrary to prevailing assumptions, the compact model Phi-2 matches or surpasses GPT-4 in question quality, demonstrating that parameter count is not decisive for epistemic agency. The framework establishes a novel, quantifiable, and optimization-friendly paradigm for assessing autonomous learning capacity; its reliability and validity are empirically confirmed, providing both theoretical grounding and empirical evidence for developing intrinsically motivated AI systems.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) can store a massive amount of knowledge, yet their potential to acquire new knowledge remains unknown. We propose a novel evaluation framework that evaluates this capability. This framework prompts LLMs to generate questions about a statement introducing scientific knowledge, simulating a curious person when facing the statement for the first time. We score the qualities of the generated questions, thereby evaluating the knowledge acquisition potential of the LLM. We apply controlled ablation studies to validate our scoring procedures. Additionally, we created a synthetic dataset consisting of 1101 statements in physics, chemistry, and maths with distinct levels of difficulties, 300 general knowledge statements, and 567 incorrect statements. Human evaluations were conducted to validate our model assessments, achieving an approximate weighted Cohen's kappa of 0.7 on all three metrics considered. We find that while large models like GPT-4 and Mistral 8x7b are adept at generating coherent and relevant questions, the smaller Phi-2 model is equally or more effective. This indicates that size does not solely determine a model's knowledge acquisition potential. The proposed framework quantifies a critical model capability that was commonly overlooked and opens up research opportunities for developing more knowledgeable AI systems
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate curiosity-driven questions
Assessing knowledge acquisition potential via question quality scoring
Comparing model sizes impact on knowledge acquisition effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel framework evaluates LLM knowledge acquisition
Synthetic dataset with 1101 scientific statements
Human validation with Cohen's kappa 0.7
🔎 Similar Papers
No similar papers found.