🤖 AI Summary
To address weak generalization of large language models under long-tailed data distributions and their difficulty in real-time adaptation to sparse, infrequent use cases, this paper proposes an explicitly learnable token system that jointly models input data features and task provenance during training. Through supervised fine-tuning, it co-optimizes token embeddings, implicit conditional generation, and fine-grained feature classification—enabling automatic token inference and on-demand activation without requiring prompting or in-context examples at inference time. The approach supports fine-grained controllable generation with enhanced flexibility. Experiments demonstrate significant improvements: +5.7% win rate in open-ended generation, >9.1% gains on long-tail domains, +14.1% relative performance in CodeRepair, and +35.3% absolute accuracy on length-sensitive instruction following. This work pioneers the unification of task provenance and data features into a single learnable, structured token representation, effectively bridging the gap between training-time controllability and inference-time adaptability.
📝 Abstract
One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask:"Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?"We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.