Neologism Learning for Controllability and Self-Verbalization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limited controllability and interpretability of large language models (LLMs). We propose a joint control-and-self-explanation framework grounded in *neologism*—the systematic introduction of novel, learnable tokens. Specifically, abstract behavioral concepts (e.g., “flattery,” “incorrect response,” “length control”) are explicitly encoded via trainable neologism embeddings; coupled with self-verbalization, the model generates natural-language definitions for each neologism. We further introduce plug-in evaluators and multi-concept collaborative training. Key contributions include: (1) the first automated discovery and semantic binding of machine-specific synonyms; (2) fine-grained, composable behavioral control; and (3) dual capability—executing neologism-based instructions *and* generating semantically consistent, pedagogically grounded explanations. Experiments demonstrate significant improvements in both behavioral control accuracy and explanation fidelity.

Technology Category

Application Category

📝 Abstract

Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...''To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.

Problem

Research questions and friction points this paper is trying to address.

Introducing new words to control LLM behavior and concepts

Enabling models to self-verbalize meanings of invented words

Validating learned concepts through plug-in evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces new word embeddings to control model concepts

Enables self-verbalization for model concept explanation

Uses plug-in evaluation to validate verbalization effectiveness

🔎 Similar Papers

No similar papers found.