From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the computational burden, high inference latency, and performance limitations imposed by long sequences in electronic health records (EHRs) when processed by large language models. To overcome these challenges, the authors propose Medical Token-Pair Encoding (MedTPE), which introduces a dependency-aware token-pair merging strategy tailored for EHR data. By compressing frequently co-occurring medical tokens into composite tokens, MedTPE achieves lossless sequence compression while fine-tuning only 0.5–1.0% of model parameters. Integrated with hierarchical tokenization expansion and self-supervised fine-tuning, MedTPE reduces input sequence length by up to 31% and lowers inference latency by 34–63% on real-world clinical data. It maintains or improves performance across four clinical prediction tasks and demonstrates strong cross-domain and multilingual generalization capabilities.

📝 Abstract

By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly frequent EHRs often yield excessively long token sequences that result in high computational costs and even reduced performance. Existing solutions either add modules for compression or remove less important tokens, which introduce additional inference latency or risk losing clinical information. To achieve lossless compression of token sequences without additional cost or loss of performance, we propose Medical Token-Pair Encoding (MedTPE), a layered method that extends standard tokenisation for EHR sequences. MedTPE merges frequently co-occurring medical token pairs into composite tokens, providing lossless compression while preserving the computational complexity through a dependency-aware replacement strategy. Only the embeddings of the newly introduced tokens of merely 0.5-1.0% of the LLM's parameters are fine-tuned via self-supervised learning. Experiments on real-world datasets for two clinical scenarios demonstrate that MedTPE reduces input token length by up to 31% and inference latency by 34-63%, while maintaining or even improving both predictive performance and output format compliance across multiple LLMs and four clinical prediction tasks. Furthermore, MedTPE demonstrates robustness across different input context lengths and generalisability to scientific and financial domains and different languages.

Problem

Research questions and friction points this paper is trying to address.

clinical prediction

electronic health records

token sequence length

computational cost

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt compression

token-pair encoding

large language models