🤖 AI Summary
Existing LLM benchmarks predominantly emphasize information retrieval and summarization, overlooking models’ ability to perform statistical reasoning and detect global rare features—e.g., those occurring in <10% of documents—within document collections. Method: We propose the Distinctive Feature Mining (DFM) task, the first systematic evaluation of LLMs’ capability to identify statistically salient features in small-to-medium document sets (10–40 documents), and introduce DiFBench—a configurable benchmark enabling multidimensional control over document scale, salience thresholds, and feature rarity. Contribution/Results: Extensive evaluation across 10 state-of-the-art models reveals that inference-augmented models outperform others, yet all exhibit substantial performance degradation with increasing document count or task complexity. Critically, models consistently misclassify high-frequency features as salient. This work uncovers a fundamental blind spot in LLMs’ statistical reasoning and establishes a novel paradigm for evaluating models in differential analysis tasks.
📝 Abstract
Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.