🤖 AI Summary
Current medical imaging research predominantly relies on voxel-level masks, which are inadequate for capturing the structured semantics required in radiology reports. This work systematically defines, for the first time, three core components of medical image parsing—entities, attributes, and relationships—and introduces a unified framework that simultaneously supports decision-making, reconstruction, and prediction objectives. Grounded in structured semantic modeling, the approach integrates entity recognition, attribute description, and reasoning over spatial and temporal relationships, with an explicit emphasis on output consistency and closure. Evaluation across 11 representative systems reveals that existing methods largely neglect attributes, relationships, and closure properties, thereby underscoring the necessity and feasibility of the proposed paradigm in advancing models from mere measurement toward interpretable, explanatory understanding.
📝 Abstract
Medical imaging research has spent a decade getting very good at one thing: producing per-voxel masks. Masks tell us size, volume, and location, and a decade of clinical infrastructure rests on those outputs. Yet the report a radiologist writes contains almost nothing a mask can express. We argue that medical imaging research should adopt medical image parsing as its central output: a structured representation in which entities, attributes, and relationships are emitted together and mutually consistent. Entities are the named structures and findings, present or absent. Attributes describe those entities, capturing things like margin regularity, enhancement pattern, or severity grade. Relationships connect them, naming where one structure sits relative to another, what abuts what, and what has changed since the prior scan. A good parse satisfies three properties, in order: (1) decision (the parse names the right things in the current image), (2) reconstruction (its content is rich enough to regenerate that image), and (3) prediction (its content is rich enough to forecast how the patient state will evolve). Quantitative measurements are derived from this content; they are not predicted alongside it. To test how close the field is to producing such an output, we audit eleven representative systems against the three parsing primitives plus closure. None emits a well-formed parse. Entities are largely solved. Attributes, relationships, and closure remain near-empty. The path forward is not a new architecture. It is a commitment to a richer output, and to training signals that reward it. Segmentation taught models to measure. Parsing asks them to explain.