CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current evaluation methods for CT report generation rely on coarse-grained matching, which fails to capture the fine-grained diagnostic accuracy required in clinical practice. To address this limitation, this work proposes CT-FineBench—the first fine-grained evaluation benchmark tailored to clinically relevant diagnostic attributes. By extracting key attributes such as location, size, and margin from reference reports and structuring them into question-answer pairs, CT-FineBench enables precise quantification of factual consistency in generated reports. The benchmark integrates the CT-RATE and Merlin datasets and employs an automated scoring mechanism that demonstrates significantly stronger correlation with expert evaluations and greater sensitivity to fine-grained errors compared to existing metrics, thereby offering more reliable and interpretable assessment outcomes.

Technology Category

Application Category

📝 Abstract

The evaluation of generated reports remains a critical challenge in Computed Tomography (CT) report generation, due to the large volume of text, the diversity and complexity of findings, and the presence of fine-grained, disease-oriented attributes. Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. Our benchmark is constructed through a meticulous, Question-Answering (QA) based process: first, we identify and structure key, finding-specific clinical attributes (like location, size, margin). Second, we systematically transform these attributes into a QA dataset, where questions probe for specific clinical details grounded in gold-standard reports. The evaluation protocol for CT-FineBench involves using this QA dataset to query a machine-generated report and scoring the correctness of the answers. This allows for a comprehensive, interpretable, and clinically-relevant assessment, moving beyond superficial lexical overlap to pinpoint specific clinical errors. Experiments show that CT-FineBench correlates better with expert clinical assessment and is substantially more sensitive to fine-grained factual errors than prior metrics.

Problem

Research questions and friction points this paper is trying to address.

CT report generation

evaluation benchmark

factual consistency

fine-grained evaluation

diagnostic fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

CT report generation

fine-grained evaluation

factual consistency