ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

167K/year
🤖 AI Summary
This work addresses the inefficiency of existing approaches in chest X-ray report generation, where conventional autoregressive models suffer from high inference latency and diffusion-based methods either require slow multi-step denoising or sacrifice textual coherence in single-step generation due to mean-field bias. To overcome these limitations, the authors propose ECHO, an efficient diffusion-based vision-language model that accelerates inference through a block-wise one-step generation strategy. ECHO introduces a Direct Conditional Distillation (DCD) framework that leverages intra-policy diffusion trajectories to construct non-factorized supervision signals, effectively capturing joint token dependencies. Additionally, a Response-Asymmetric Diffusion (RAD) training strategy is devised to enhance training efficiency. Experiments demonstrate that ECHO achieves 64.33% and 60.58% relative improvements on RaTE and SemScore metrics, respectively, while offering an 8× faster inference speed without compromising clinical accuracy.

Technology Category

Application Category

📝 Abstract
Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists'workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.
Problem

Research questions and friction points this paper is trying to address.

Chest X-ray report generation
diffusion models
inference latency
textual coherence
one-step generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step diffusion
Direct Conditional Distillation
Response-Asymmetric Diffusion
chest X-ray report generation
vision-language model
🔎 Similar Papers
No similar papers found.