Semantic Smoothing for Language Models via Distribution Estimation and Embeddings

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the limitations of traditional language model smoothing techniques, which struggle to leverage statistical information shared among semantically similar contexts, particularly under data sparsity. The authors propose a novel semantic-aware smoothing method that, for the first time, incorporates semantic proximity of context embeddings as a prior within an interpolation framework constrained by KL divergence, yielding theoretically grounded semantic smoothing. The approach is compatible with multiple source distributions and empirical synonym distributions, and can be seamlessly integrated with classical smoothing techniques such as additive (Laplace) smoothing and Kneser-Ney smoothing. Experiments on both synthetic Markovian data and bigram models trained on WikiText-103 demonstrate that the proposed method significantly reduces test perplexity, consistently delivering substantial performance gains over existing smoothing strategies.
📝 Abstract
We propose semantic smoothing, a smoothing method for language models that uses embeddings to share statistical observations across semantically similar contexts. The starting point is a decomposition of log-perplexity that motivates smoothing as a collection of distribution-estimation problems under Kullback-Leibler (KL) loss. We then show that, under a Lipschitz-logit model for embedding-based language generation, proximity of context embeddings implies proximity of the corresponding next-word distributions in KL divergence. Combining these observations, we formulate semantic smoothing as distribution estimation in KL loss with KL-proximity side information. For $n$ samples on a $d$-symbol alphabet with a side-information distribution at KL distance $Δ$, we give an interpolation estimator with worst-case KL risk $O(\min\{Δ,d/n\})$, and prove a matching-order lower bound for uniform side information. We extend the estimator to multiple and empirically estimated synonymous distributions. Experiments on synthetic Markov data and WikiText-103 bigram models using Word2Vec, GloVe, and GPT-2 embeddings show that semantic smoothing consistently reduces test perplexity when applied to add-constant and Kneser-Ney estimates.
Problem

Research questions and friction points this paper is trying to address.

language models
distribution estimation
semantic smoothing
KL divergence
data sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic smoothing
distribution estimation
KL divergence
embedding-based language models
perplexity reduction
🔎 Similar Papers
No similar papers found.