🤖 AI Summary
Lengthy system prompts in large language models (LLMs) incur high inference latency, substantial computational overhead, and inefficient context-space utilization. Method: We propose a single-token prompt compression method that requires neither internal model access nor annotated data. Our approach employs a lightweight three-stage training framework integrating reconstruction-based content encoding and behavioral distillation to self-supervise the semantic and functional condensation of system prompts into a single learnable token. Contribution/Results: Evaluated on three benchmark datasets, our method achieves up to 3000× prompt-length compression, significantly reducing inference cost while retaining ~98% of original task performance. To the best of our knowledge, this is the first work achieving high-fidelity, single-token system prompt compression under zero-shot, black-box conditions—establishing a novel paradigm for efficient LLM system prompt deployment.
📝 Abstract
Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.