Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Lengthy system prompts in large language models (LLMs) incur high inference latency, substantial computational overhead, and inefficient context-space utilization. Method: We propose a single-token prompt compression method that requires neither internal model access nor annotated data. Our approach employs a lightweight three-stage training framework integrating reconstruction-based content encoding and behavioral distillation to self-supervise the semantic and functional condensation of system prompts into a single learnable token. Contribution/Results: Evaluated on three benchmark datasets, our method achieves up to 3000× prompt-length compression, significantly reducing inference cost while retaining ~98% of original task performance. To the best of our knowledge, this is the first work achieving high-fidelity, single-token system prompt compression under zero-shot, black-box conditions—establishing a novel paradigm for efficient LLM system prompt deployment.

Technology Category

Application Category

📝 Abstract

Carefully engineered system prompts play a critical role in guiding the behavior of LLM agents, but their considerable length introduces significant drawbacks, including increased inference latency, higher computational cost, and reduced effective context length. This raises the question of whether such lengthy prompts can be replaced by a drastically reduced number of tokens while preserving their behavioral effect on downstream tasks. To enable this, we propose a lightweight three-stage training framework that learns a single prompt-specific Behavior-Equivalent token ([BE]). The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt 's downstream behavior into this single token. Importantly, our method requires no access to model internals, no auxiliary compression models, and no labeled responses. Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts. This substantially reduces inference cost and leaves almost the entire context window available for user inputs.

Problem

Research questions and friction points this paper is trying to address.

Replacing lengthy system prompts with minimal tokens while preserving behavioral effects

Reducing inference latency and computational costs caused by long prompts

Maintaining downstream task performance despite extreme prompt length reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single token replaces long prompts via training

Lightweight framework encodes prompt behavior

No model internals or labeled data required

🔎 Similar Papers

No similar papers found.