Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

It remains unclear whether large language models’ (LLMs) strong performance on clinical multiple-choice diagnosis benchmarks (e.g., MedQA) stems from genuine implicit probabilistic reasoning or from shallower statistical regularities in training data. Method: We propose FBPR—a lightweight, corpus-driven frequency-based baseline—implementing a smoothed Naïve Bayes model that leverages co-occurrence statistics between medical concepts and diagnostic terms extracted directly from LLM pretraining corpora (e.g., OLMo/Llama). Contribution/Results: FBPR achieves performance on MedQA comparable to that of its corresponding LLMs, yet exhibits highly complementary error patterns: correct prediction overlap is only marginally above chance, confirming independence from LLM-style black-box inference. This work provides the first systematic evidence that simple probabilistic co-occurrence statistics constitute a surprisingly strong, interpretable diagnostic baseline—establishing a new paradigm for efficient, transparent clinical AI modeling.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' probabilistic reasoning in clinical diagnosis

Compares lightweight frequency-based method with LLM performance

Assesses complementary strengths for potential hybrid approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-based probabilistic ranker using Naive Bayes

Scoring options with concept-diagnosis co-occurrence statistics

Lightweight baseline complementary to LLM inference

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks