🤖 AI Summary
Accurate 3D CT reconstruction from extremely sparse-view (<10) X-ray projections remains challenging due to the limited representational capacity of CNNs and scarcity of high-quality training data.
Method: We propose X-former—a novel framework integrating an MLP-based image tokenizer with a Transformer encoder to enable dynamic modeling for arbitrary numbers of projection views; we further introduce X-triplane, an implicit neural radiance field representation for 3D volumetric attenuation coefficients, enhancing geometric consistency and fine-detail recovery. To support training, we construct Torso-16K, the first large-scale synthetic dataset comprising paired X-ray projections and ground-truth volumetric reconstructions.
Contribution/Results: Experiments demonstrate that X-former achieves high-fidelity reconstruction in under one second—improving PSNR by 1.5 dB over state-of-the-art methods while accelerating inference by 27×. Moreover, it significantly boosts downstream lung segmentation performance, establishing a new paradigm for low-dose CT reconstruction.
📝 Abstract
Sparse-view 3D CT reconstruction aims to recover volumetric structures from a limited number of 2D X-ray projections. Existing feedforward methods are constrained by the limited capacity of CNN-based architectures and the scarcity of large-scale training datasets. In this paper, we propose an X-ray Large Reconstruction Model (X-LRM) for extremely sparse-view (<10 views) CT reconstruction. X-LRM consists of two key components: X-former and X-triplane. Our X-former can handle an arbitrary number of input views using an MLP-based image tokenizer and a Transformer-based encoder. The output tokens are then upsampled into our X-triplane representation, which models the 3D radiodensity as an implicit neural field. To support the training of X-LRM, we introduce Torso-16K, a large-scale dataset comprising over 16K volume-projection pairs of various torso organs. Extensive experiments demonstrate that X-LRM outperforms the state-of-the-art method by 1.5 dB and achieves 27x faster speed and better flexibility. Furthermore, the downstream evaluation of lung segmentation tasks also suggests the practical value of our approach. Our code, pre-trained models, and dataset will be released at https://github.com/caiyuanhao1998/X-LRM