🤖 AI Summary
This work addresses the challenge of automatic evaluation for medical open-domain question answering in low-resource languages such as French, where expert annotations are scarce. The authors propose a generator-aware evaluation framework to systematically compare the effectiveness of general-purpose and biomedical-domain-adapted large language models (LLMs) as judges of semantic equivalence. They further investigate lightweight supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) strategies to enhance evaluation robustness. Results demonstrate that domain-adapted LLMs and larger general-purpose models achieve the highest agreement with human experts. Moreover, smaller models refined via SFT and GRPO exhibit significantly reduced sensitivity to the choice of answer generator, enabling efficient and reliable automatic evaluation in low-resource settings.
📝 Abstract
Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.