Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

Existing token-level distillation approaches for long-text generation exhibit limited performance in medical report generation, as they treat all tokens uniformly and overlook the critical roles of pathological keywords and the end-of-sequence (EOS) token in ensuring content quality. To address this, this work proposes the DIVE framework, which—while keeping the vision-language model backbone frozen—introduces a key-token-weighted cross-entropy loss to enhance supervision over pathological terms and the EOS token. Furthermore, DIVE replaces fixed residual connections with a state-aware dynamic adapter to enable adaptive correction of decoding drift. Experimental results demonstrate that DIVE achieves significant improvements on MIMIC-CXR and CheXpert Plus across BLEU-4, ROUGE-L, and RadGraph F1 metrics, while maintaining competitive performance on CheXbert F1.

📝 Abstract

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.

Problem

Research questions and friction points this paper is trying to address.

long-form medical report generation

token-level distillation

decisive tokens

pathology-related tokens

end-of-sequence event

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic distillation

decisive-token supervision

long-form medical report generation