The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing LLM benchmarks predominantly emphasize information retrieval and summarization, overlooking models’ ability to perform statistical reasoning and detect global rare features—e.g., those occurring in <10% of documents—within document collections. Method: We propose the Distinctive Feature Mining (DFM) task, the first systematic evaluation of LLMs’ capability to identify statistically salient features in small-to-medium document sets (10–40 documents), and introduce DiFBench—a configurable benchmark enabling multidimensional control over document scale, salience thresholds, and feature rarity. Contribution/Results: Extensive evaluation across 10 state-of-the-art models reveals that inference-augmented models outperform others, yet all exhibit substantial performance degradation with increasing document count or task complexity. Critically, models consistently misclassify high-frequency features as salient. This work uncovers a fundamental blind spot in LLMs’ statistical reasoning and establishes a novel paradigm for evaluating models in differential analysis tasks.

Technology Category

Application Category

📝 Abstract

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to identify globally distinctive features

Assessing statistical reasoning for rarity detection across documents

Testing models on fine-grained distinctive feature mining tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distinctive Feature Mining task for rarity detection

DiFBench benchmark with configurable parameters

Evaluates statistical reasoning across document sets

🔎 Similar Papers

No similar papers found.

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Data Scientist, Evaluations - Meta Superintelligence Labs