🤖 AI Summary
This work addresses the challenge that small language models (SLMs), constrained by limited parameter capacity, are prone to factual errors and must judiciously decide between generating responses autonomously or invoking external tools. To this end, the authors propose LaCy, a novel pretraining approach that leverages semantic acceptability as a criterion to distinguish between knowledge the model should internalize and tasks better delegated to external resources. By integrating spaCy-based syntactic parsing to enrich token-level loss signals, LaCy enhances the model’s ability to make such distinctions during training. Experimental results demonstrate that SLMs trained with LaCy achieve significantly higher FactScore in cascaded generation tasks compared to those trained with Rho-based or LLM-judge baselines, while offering a simpler implementation and lower training cost.
📝 Abstract
Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.