🤖 AI Summary
This work addresses the prevalent overestimation of performance in semi-supervised 3D medical image segmentation, often caused by confirmation bias in pseudo-labeling and test-set leakage. To mitigate these issues, we propose TCSeg, a tri-space calibration segmentation framework that explicitly disentangles confidence from uncertainty for the first time. TCSeg jointly calibrates features, probabilities, and image representations across three complementary spaces and incorporates a dual-axis reliability assessment mechanism to effectively identify and correct confirmation bias in pseudo-labels. Furthermore, we advocate for a multi-round final checkpoint evaluation protocol to establish a more rigorous benchmarking standard. Extensive experiments on three benchmark datasets demonstrate that TCSeg achieves robust performance, revealing that existing state-of-the-art results may stem from overfitting, thereby offering the community a more reliable evaluation paradigm.
📝 Abstract
Semi-supervised learning has become a dominant paradigm for reducing annotation costs. However, we argue that the current progress is clouded by a twofold overconfidence problem. Algorithmically, mainstream pseudo-labeling frameworks often conflate prediction confidence with uncertainty, leading to severe confirmation bias. Strategically, since multiple benchmark datasets lack dedicated validation sets, some studies use the test set for validation as well, leading to inflated performance estimates. Subsequent methods, compelled to employ the same strategy to surpass reported SOTA, trigger an arms race of overfitting. This raises concerns that the impressive numerical gains in the community may reflect overfitting rather than genuine progress. Thus, we propose a tri-space calibrated segmentation framework founded on a principled dual-axis reliability assessment engine. It explicitly decouples confidence from uncertainty and uses this signal to detect and correct confirmation bias across feature, probability, and image spaces in a collaborative manner. Across three benchmark datasets, TCSeg consistently delivers strong performance under existing evaluation protocols. More importantly, we advocate that the community report final-checkpoint results under multiple-run protocols, thereby establishing more rigorous benchmarks with a more realistic perspective. Code will be available: github.com/DirkLiii/TCSeg.