🤖 AI Summary
This work addresses the challenge that small language models struggle to simultaneously achieve high inference speed and strong task performance in high-throughput, low-latency scenarios. To this end, the authors propose TASC, a framework comprising two components: TASC-ft for fine-tuning and TASC-spec for inference. TASC-ft employs iterative task-adaptive vocabulary expansion during fine-tuning, while TASC-spec leverages frequent n-grams from output corpora to enable training-free speculative decoding. Notably, this approach requires no architectural modifications or additional training, and it is the first to dynamically integrate task-specific n-grams into the acceleration pipeline. Experiments across multiple low-diversity generation tasks demonstrate that TASC substantially improves inference efficiency while preserving original model performance.
📝 Abstract
Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.