ZACH-ViT: A Zero-Token Vision Transformer with ShuffleStrides Data Augmentation for Robust Lung Ultrasound Classification

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Distinguishing cardiogenic pulmonary edema (CPE), non-cardiogenic pathologies (e.g., ARDS-like or interstitial lung disease), and normal lung tissue in lung ultrasound (LUS) videos remains challenging due to severe visual heterogeneity and high overlap between B-lines and pleural artifacts. To address this, we propose a lightweight, permutation-invariant Vision Transformer that eliminates positional encoding and the [CLS] token, adopting a zero-token hierarchical architecture. We further introduce ShuffleStrides—a novel data augmentation technique explicitly designed for probe-scan sequences—to enhance generalization under limited data. Evaluated on 380 clinical LUS videos, our model achieves a ROC-AUC of 0.79 (sensitivity: 0.60; specificity: 0.91), trains 1.35× faster than prior approaches, and uses only 40% of the parameters of current state-of-the-art models. To our knowledge, this is the first method enabling fully order-agnostic modeling of medical ultrasound video inputs.

Technology Category

Application Category

📝 Abstract

Differentiating cardiogenic pulmonary oedema (CPE) from non-cardiogenic and structurally normal lungs in lung ultrasound (LUS) videos remains challenging due to the high visual variability of non-cardiogenic inflammatory patterns (NCIP/ARDS-like), interstitial lung disease, and healthy lungs. This heterogeneity complicates automated classification as overlapping B-lines and pleural artefacts are common. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a 0.25 M-parameter Vision Transformer variant that removes both positional embeddings and the [CLS] token, making it fully permutation-invariant and suitable for unordered medical image data. To enhance generalization, we propose ShuffleStrides Data Augmentation (SSDA), which permutes probe-view sequences and frame orders while preserving anatomical validity. ZACH-ViT was evaluated on 380 LUS videos from 95 critically ill patients against nine state-of-the-art baselines. Despite the heterogeneity of the non-cardiogenic group, ZACH-ViT achieved the highest validation and test ROC-AUC (0.80 and 0.79) with balanced sensitivity (0.60) and specificity (0.91), while all competing models collapsed to trivial classification. It trains 1.35x faster than Minimal ViT (0.62M parameters) with 2.5x fewer parameters, supporting real-time clinical deployment. These results show that aligning architectural design with data structure can outperform scale in small-data medical imaging.

Problem

Research questions and friction points this paper is trying to address.

Classifying lung ultrasound videos for pulmonary edema diagnosis

Addressing visual variability in non-cardiogenic inflammatory patterns

Developing permutation-invariant models for unordered medical imaging data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-token Vision Transformer removes positional embeddings

ShuffleStrides augmentation permutes sequences and frames

Compact 0.25M-parameter model trains faster with fewer parameters

🔎 Similar Papers

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification