🤖 AI Summary
Current evaluation of pulmonary CT models relies on static retrospective data, making it difficult to disentangle the influence of factors such as lesion size and location on model performance. This work proposes a programmable virtual lesion benchmarking framework that enables controlled model assessment across 13 experimental configurations for the first time. The framework employs a four-stage pipeline comprising multi-dataset nodule modeling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-based conditional synthesis. Leveraging a 54-dimensional nodule feature library derived from multi-source CT data, the method generates high-fidelity synthetic images with Fréchet Inception Distance (FID) scores comparable to real data. Across 55,469 virtual lesion experiments, model performance rankings showed strong agreement with clinical observations (ρ = 0.93) and uncovered critical biases—such as size-dependent prediction collapse—that conventional benchmarks fail to detect.
📝 Abstract
We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.