🤖 AI Summary
Medical imaging models often rely on opaque, holistic embeddings, leading to semantic entanglement, poor interpretability, and limited generalization. To address this, we propose the Organ-aware Tokenization (OWT) framework—an organ-level tokenization approach that explicitly disentangles input images into independent token groups corresponding to anatomical organs. OWT introduces an organ-perceptive token grouping mechanism and a novel Token Group-based Reconstruction training paradigm, integrating organ-aware masked reconstruction, contrastive semantic alignment, and joint CT/MRI multimodal optimization. Evaluated across multiple benchmarks, OWT significantly improves image reconstruction fidelity and organ segmentation accuracy. Notably, it enables—for the first time—organ-controllable semantic generation and cross-modal semantic retrieval. Compared to conventional embedding methods, OWT delivers superior clinical interpretability and enhanced generalization to downstream tasks, establishing a new foundation for anatomy-aware representation learning in medical imaging.
📝 Abstract
Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.