Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatic evaluation for medical open-domain question answering in low-resource languages such as French, where expert annotations are scarce. The authors propose a generator-aware evaluation framework to systematically compare the effectiveness of general-purpose and biomedical-domain-adapted large language models (LLMs) as judges of semantic equivalence. They further investigate lightweight supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) strategies to enhance evaluation robustness. Results demonstrate that domain-adapted LLMs and larger general-purpose models achieve the highest agreement with human experts. Moreover, smaller models refined via SFT and GRPO exhibit significantly reduced sensitivity to the choice of answer generator, enabling efficient and reliable automatic evaluation in low-resource settings.

Technology Category

Application Category

📝 Abstract
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge
medical open-ended QA
semantic equivalence
automatic evaluation
low-resource
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge
medical open-ended QA
generator-aware evaluation
supervised fine-tuning
Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.
I
Ikram Belmadani
Aix-Marseille Univ., CNRS, LIS UMR 7020, 13000 Marseille, France
O
Oumaima El Khettari
Nantes Univ., École Centrale Nantes, CNRS, LS2N, UMR 6004, 44000 Nantes, France
P
Pacôme Constant dit Beaufils
Nantes Université, CHU Nantes, PHU 11: Santé Publique, Clinique des données, INSERM, CIC 1413, 44000 Nantes, France
Richard Dufour
Richard Dufour
LS2N - TALN/NLP research group - Nantes University
Natural language processingBiomedical domainLanguage modelingSpontaneous speech
Benoit Favre
Benoit Favre
Professeur CNU 27, LIS UMR 7020, Aix-Marseille University
Natural Language ProcessingSpoken Language UnderstandingParsingMachine Learning