DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the high computational cost of lengthy prompts in large language models, which rely on in-context learning and chain-of-thought prompting to enhance reasoning. While existing prompt pruning methods suffer from inefficiency, this study introduces a novel diffusion-based approach to prompt compression. The proposed framework employs iterative denoising mask prediction within a parallel pruning architecture, integrating hierarchical signals at both sample and token levels to enable efficient and controllable compression. Experiments demonstrate that the method maintains or even improves reasoning accuracy while achieving up to 80% reduction in prompt length, with strong performance across in-domain, out-of-domain, and cross-model settings.

Technology Category

Application Category

📝 Abstract

In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80\% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

prompt compression

in-context learning

chain-of-thought

token pruning

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

prompt pruning

in-context learning