Task-Centric Acceleration of Small-Language Models

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that small language models struggle to simultaneously achieve high inference speed and strong task performance in high-throughput, low-latency scenarios. To this end, the authors propose TASC, a framework comprising two components: TASC-ft for fine-tuning and TASC-spec for inference. TASC-ft employs iterative task-adaptive vocabulary expansion during fine-tuning, while TASC-spec leverages frequent n-grams from output corpora to enable training-free speculative decoding. Notably, this approach requires no architectural modifications or additional training, and it is the first to dynamically integrate task-specific n-grams into the acceleration pipeline. Experiments across multiple low-diversity generation tasks demonstrate that TASC substantially improves inference efficiency while preserving original model performance.

Technology Category

Application Category

📝 Abstract
Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.
Problem

Research questions and friction points this paper is trying to address.

small language models
inference efficiency
low-latency
task-specific applications
output variability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-Adaptive Sequence Compression
Small Language Models
Speculative Decoding
Tokenizer Enrichment
Inference Acceleration
🔎 Similar Papers
No similar papers found.