🤖 AI Summary
Distinguishing cardiogenic pulmonary edema (CPE), non-cardiogenic pathologies (e.g., ARDS-like or interstitial lung disease), and normal lung tissue in lung ultrasound (LUS) videos remains challenging due to severe visual heterogeneity and high overlap between B-lines and pleural artifacts. To address this, we propose a lightweight, permutation-invariant Vision Transformer that eliminates positional encoding and the [CLS] token, adopting a zero-token hierarchical architecture. We further introduce ShuffleStrides—a novel data augmentation technique explicitly designed for probe-scan sequences—to enhance generalization under limited data. Evaluated on 380 clinical LUS videos, our model achieves a ROC-AUC of 0.79 (sensitivity: 0.60; specificity: 0.91), trains 1.35× faster than prior approaches, and uses only 40% of the parameters of current state-of-the-art models. To our knowledge, this is the first method enabling fully order-agnostic modeling of medical ultrasound video inputs.
📝 Abstract
Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.