🤖 AI Summary
Existing video-based deep learning models for cardiac amyloidosis (CA) classification often rely on non-clinically relevant regions in echocardiographic videos, undermining clinical interpretability and robustness.
Method: We propose an anatomy-constrained Video Transformer framework: (1) dynamically generating myocardial masks from endo- and epicardial point clouds to extract only myocardial image patches and corresponding deformation points as tokens; (2) embedding this anatomical prior into the masked autoencoder (MAE) pretraining objective to enforce focus on pathologically relevant myocardial motion patterns; and (3) leveraging attention visualization to spatially localize decision evidence exclusively within the myocardium.
Results: Our method achieves significantly higher CA classification accuracy than full-video Transformers and—crucially—enables the first dynamic, anatomy-aware, and spatially grounded model interpretation. It thus delivers both improved diagnostic performance and enhanced clinical trustworthiness through interpretable, myocardium-specific reasoning.
📝 Abstract
Cardiac amyloidosis (CA) is a rare cardiomyopathy, with typical abnormalities in clinical measurements from echocardiograms such as reduced global longitudinal strain of the myocardium. An alternative approach for detecting CA is via neural networks, using video classification models such as convolutional neural networks. These models process entire video clips, but provide no assurance that classification is based on clinically relevant features known to be associated with CA. An alternative paradigm for disease classification is to apply models to quantitative features such as strain, ensuring that the classification relates to clinically relevant features. Drawing inspiration from this approach, we explicitly constrain a transformer model to the anatomical region where many known CA abnormalities occur -- the myocardium, which we embed as a set of deforming points and corresponding sampled image patches into input tokens. We show that our anatomical constraint can also be applied to the popular self-supervised learning masked autoencoder pre-training, where we propose to mask and reconstruct only anatomical patches. We show that by constraining both the transformer and pre-training task to the myocardium where CA imaging features are localized, we achieve increased performance on a CA classification task compared to full video transformers. Our model provides an explicit guarantee that the classification is focused on only anatomical regions of the echo, and enables us to visualize transformer attention scores over the deforming myocardium.