🤖 AI Summary
This work proposes the first formal verification framework for clinical reasoning in vision-language models (VLMs) generating radiology reports. Addressing the prevalent issue of logical inconsistencies—such as diagnoses lacking perceptual grounding or omitting necessary inferences, which conventional evaluation metrics fail to detect—the framework leverages a neuro-symbolic system to automatically formalize free-text imaging findings into structured logical propositions. These propositions are then validated against a clinical knowledge base using an SMT solver (Z3) to verify logical entailment. The approach identifies reasoning failures like overly conservative observations and unsupported hallucinations that evade traditional metrics, while providing provable post-hoc correctness guarantees. Evaluations across five chest X-ray datasets on seven VLMs demonstrate that the framework significantly enhances diagnostic precision and plausibility, thereby improving the trustworthiness of generative clinical assistants.
📝 Abstract
Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.