Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Japan lacks a localized benchmark for evaluating large language models (LLMs) in medical applications. Method: This study systematically assesses the applicability of HealthBench to Japanese clinical settings via machine translation, LLM-as-a-judge automated classification, and empirical evaluation across 5,000 multi-scenario cases. Contribution/Results: We identify critical issues—including clinical guideline mismatches, institutional discrepancies, and cultural norm conflicts—arising from direct translation. To address these, we propose J-HealthBench, a context-aware, Japan-specific medical evaluation benchmark emphasizing clinical integrity and cultural adaptation. Experiments reveal that GPT-4.1 underperforms due to non-localized scoring criteria; native Japanese LLMs fail severely owing to clinical knowledge gaps; and over 60% of original HealthBench items require structural redefinition to align with Japanese clinical practice. This work establishes a methodological paradigm and reusable localization framework for cross-lingual medical AI evaluation.

Technology Category

Application Category

📝 Abstract
This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. While robust evaluation frameworks are crucial for the safe development of medical LLMs, resources in Japanese remain limited, often relying on translated multiple-choice questions. Our research addresses this gap by first establishing a performance baseline, applying a machine-translated version of HealthBench's 5,000 scenarios to evaluate both a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Second, we employ an LLM-as-a-Judge approach to systematically classify the benchmark's scenarios and rubric criteria, identifying "contextual gaps" where content is misaligned with Japan's clinical guidelines, healthcare systems, or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches and a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification indicates that while the majority of scenarios are applicable, a substantial portion of the rubric criteria requires localization. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localized adaptation, a J-HealthBench, to ensure the reliable and safe evaluation of medical LLMs in Japan.
Problem

Research questions and friction points this paper is trying to address.

Evaluating medical LLMs in Japan lacks appropriate localized benchmarks
Direct translation of medical benchmarks creates contextual mismatches with Japanese guidelines
Existing Japanese medical evaluation resources rely heavily on translated multiple-choice questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Applied machine-translated HealthBench scenarios to Japanese models
Used LLM-as-a-Judge to classify contextual gaps in benchmark
Identified need for localized adaptation of rubric criteria
🔎 Similar Papers
No similar papers found.
S
Shohei Hisada
Nara Institute of Science and Technology, Nara, Japan
E
Endo Sunao
Nara Institute of Science and Technology, Nara, Japan
H
Himi Yamato
Nara Institute of Science and Technology, Nara, Japan
Shoko Wakamiya
Shoko Wakamiya
Nara Institute of Science and Technology
Computer ScienceSocial ComputingDatabaseSocial MediaTwitter
E
Eiji Aramaki
Nara Institute of Science and Technology, Nara, Japan