🤖 AI Summary
Existing benchmarks inadequately assess large language models’ (LLMs) capabilities on long-tail knowledge and domain-specific reasoning. To address this, we introduce the first long-tail knowledge evaluation benchmark grounded in authentic professional forums, spanning 20 academic and industrial domains and 502 real-world tasks. Our methodology features four innovations: (1) a fine-grained, multi-dimensional evaluation framework; (2) hierarchical difficulty design; (3) realistic scenario modeling; and (4) cross-disciplinary knowledge integration—enabled by expert annotation and stratified data construction to ensure semantic clarity and answer uniqueness. Evaluating 12 state-of-the-art LLMs reveals substantial performance divergence on professional reasoning tasks, exposing systemic deficits in deep domain expertise and complex cross-disciplinary reasoning. This benchmark establishes a reproducible, highly discriminative evaluation paradigm and provides concrete, actionable directions for advancing LLM capabilities in specialized domains.
📝 Abstract
Large Language Models (LLMs) have made rapid progress in reasoning, question answering, and professional applications; however, their true capabilities remain difficult to evaluate using existing benchmarks. Current datasets often focus on simplified tasks or artificial scenarios, overlooking long-tail knowledge and the complexities of real-world applications. To bridge this gap, we propose LPFQA, a long-tail knowledge-based benchmark derived from authentic professional forums across 20 academic and industrial fields, covering 502 tasks grounded in practical expertise. LPFQA introduces four key innovations: fine-grained evaluation dimensions that target knowledge depth, reasoning, terminology comprehension, and contextual analysis; a hierarchical difficulty structure that ensures semantic clarity and unique answers; authentic professional scenario modeling with realistic user personas; and interdisciplinary knowledge integration across diverse domains. We evaluated 12 mainstream LLMs on LPFQA and observed significant performance disparities, especially in specialized reasoning tasks. LPFQA provides a robust, authentic, and discriminative benchmark for advancing LLM evaluation and guiding future model development.