DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of lengthy prompts in large language models, which rely on in-context learning and chain-of-thought prompting to enhance reasoning. While existing prompt pruning methods suffer from inefficiency, this study introduces a novel diffusion-based approach to prompt compression. The proposed framework employs iterative denoising mask prediction within a parallel pruning architecture, integrating hierarchical signals at both sample and token levels to enable efficient and controllable compression. Experiments demonstrate that the method maintains or even improves reasoning accuracy while achieving up to 80% reduction in prompt length, with strong performance across in-domain, out-of-domain, and cross-model settings.
📝 Abstract
In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80\% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

prompt compression
in-context learning
chain-of-thought
token pruning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
prompt pruning
in-context learning
token-level compression
mask prediction
🔎 Similar Papers
2024-10-05Conference on Empirical Methods in Natural Language ProcessingCitations: 0