DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

๐Ÿ“… 2026-05-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

189K/year
๐Ÿค– AI Summary
This work addresses the inefficiency in conventional multimodal distillation, where a large proportion of promptsโ€”up to 69%โ€”carry redundant information, leading to rapid performance saturation in student models. To overcome this limitation, the authors propose a criterion for evaluating prompt utility based on the divergence (ฮ”) between teacher and student answer distributions. Leveraging this metric, they design a staged synthesis pipeline that actively generates high-information prompts tailored to student failure modes. The approach integrates answer distribution divergence quantification, prompt redirection synthesis, and cross-model transfer distillation to construct DeltaPrompts, a dataset comprising 200,000 samples. Evaluated across ten chart, document, and perceptual reasoning benchmarks under three distinct settings, the method achieves an average relative improvement of 15%, demonstrating significant gains even over highly optimized student models.
๐Ÿ“ Abstract
Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal distillation
zero-delta prompts
Vision-Language Models
answer distribution divergence
prompt efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal distillation
zero-delta trap
answer divergence
synthetic prompt generation
DeltaPrompts