IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the limitation of existing sentence representations, which predominantly capture semantic content while neglecting individual linguistic style and dialectal variation—factors critical for applications sensitive to language habits. To overcome this, the authors propose IDIOLEX, a novel framework that jointly models linguistic variation at both individual and community levels. By integrating speaker- or community-level metadata with linguistic features, IDIOLEX learns continuous, disentangled representations of style and dialect that are independent of semantics. The resulting idiolect representations are unified and transferable across domains, as demonstrated through evaluations on Arabic and Spanish dialects. Empirical results show that IDIOLEX not only improves cross-domain stylistic classification performance but also serves effectively as a training objective to enhance large language models’ ability to align with target linguistic styles.

Technology Category

Application Category

📝 Abstract

Existing sentence representations primarily encode what a sentence says, rather than how it is expressed, even though the latter is important for many applications. In contrast, we develop sentence representations that capture style and dialect, decoupled from semantic content. We call this the task of idiolectal representation learning. We introduce IDIOLEX, a framework for training models that combines supervision from a sentence's provenance with linguistic features of a sentence's content, to learn a continuous representation of each sentence's style and dialect. We evaluate the approach on dialects of both Arabic and Spanish. The learned representations capture meaningful variation and transfer across domains for analysis and classification. We further explore the use of these representations as training objectives for stylistically aligning language models. Our results suggest that jointly modeling individual and community-level variation provides a useful perspective for studying idiolect and supports downstream applications requiring sensitivity to stylistic differences, such as developing diverse and accessible LLMs.

Problem

Research questions and friction points this paper is trying to address.

idiolectal representation

stylistic variation

sentence representation

dialect

style disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

idiolectal representation

stylistic variation

dialect modeling