Whose Name Comes Up? Auditing LLM-Based Scholar Recommendations

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks a multidimensional auditing framework for evaluating large language models (LLMs) in expert recommendation within physics. Method: We construct a novel, real-world academic benchmark grounded in APS and OpenAlex data, covering disciplinary, temporal, and seniority dimensions, and propose the first multidimensional fairness auditing framework for scholar recommendation—quantifying biases along gender, ethnicity, seniority, geography, and the “rich-get-richer” effect. Our methodology integrates factual verification benchmarks, consistency and format robustness analysis, bias metrics (e.g., representation ratio, citation preference), and cross-model stability comparison. Contribution/Results: All six open-source LLMs exhibit significant biases: systematic overranking of senior scholars, underrepresentation of Asian scientists, reinforcement of male-dominated patterns, and severe geographic homogeneity. Mixtral-8x7B demonstrates the highest stability, whereas Llama3.1-70B exhibits the greatest variability. This study establishes foundational infrastructure and metrics for equitable, domain-informed scholarly recommendation.

Technology Category

Application Category

📝 Abstract
This paper evaluates the performance of six open-weight LLMs (llama3-8b, llama3.1-8b, gemma2-9b, mixtral-8x7b, llama3-70b, llama3.1-70b) in recommending experts in physics across five tasks: top-k experts by field, influential scientists by discipline, epoch, seniority, and scholar counterparts. The evaluation examines consistency, factuality, and biases related to gender, ethnicity, academic popularity, and scholar similarity. Using ground-truth data from the American Physical Society and OpenAlex, we establish scholarly benchmarks by comparing model outputs to real-world academic records. Our analysis reveals inconsistencies and biases across all models. mixtral-8x7b produces the most stable outputs, while llama3.1-70b shows the highest variability. Many models exhibit duplication, and some, particularly gemma2-9b and llama3.1-8b, struggle with formatting errors. LLMs generally recommend real scientists, but accuracy drops in field-, epoch-, and seniority-specific queries, consistently favoring senior scholars. Representation biases persist, replicating gender imbalances (reflecting male predominance), under-representing Asian scientists, and over-representing White scholars. Despite some diversity in institutional and collaboration networks, models favor highly cited and productive scholars, reinforcing the rich-getricher effect while offering limited geographical representation. These findings highlight the need to improve LLMs for more reliable and equitable scholarly recommendations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance in expert recommendations across physics tasks
Identifying biases in gender, ethnicity, and academic popularity in LLM outputs
Assessing accuracy and consistency of LLM-based scholar recommendations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates six LLMs for expert recommendations
Uses APS and OpenAlex as ground-truth data
Analyzes biases in gender, ethnicity, citations
🔎 Similar Papers
No similar papers found.
D
Daniele Barolo
Complexity Science Hub
C
Chiara Valentin
Graz University of Technology
Fariba Karimi
Fariba Karimi
Graz University of Technology (TU Graz) / Complexity Science Hub (CSH)
ERC Network FairnessComplex systemsComputational Social Science
L
Luis Gal'arraga
INRIA/IRISA Research Center
G
Gonzalo G. M'endez
Polytechnic University of Valencia
L
Lisette Esp'in-Noboa
Complexity Science Hub, Graz University of Technology, Central European University