The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM benchmarks predominantly emphasize information retrieval and summarization, overlooking models’ ability to perform statistical reasoning and detect global rare features—e.g., those occurring in <10% of documents—within document collections. Method: We propose the Distinctive Feature Mining (DFM) task, the first systematic evaluation of LLMs’ capability to identify statistically salient features in small-to-medium document sets (10–40 documents), and introduce DiFBench—a configurable benchmark enabling multidimensional control over document scale, salience thresholds, and feature rarity. Contribution/Results: Extensive evaluation across 10 state-of-the-art models reveals that inference-augmented models outperform others, yet all exhibit substantial performance degradation with increasing document count or task complexity. Critically, models consistently misclassify high-frequency features as salient. This work uncovers a fundamental blind spot in LLMs’ statistical reasoning and establishes a novel paradigm for evaluating models in differential analysis tasks.

Technology Category

Application Category

📝 Abstract
Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to identify globally distinctive features
Assessing statistical reasoning for rarity detection across documents
Testing models on fine-grained distinctive feature mining tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distinctive Feature Mining task for rarity detection
DiFBench benchmark with configurable parameters
Evaluates statistical reasoning across document sets
🔎 Similar Papers
No similar papers found.