Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively steering large language models and mitigating hallucinations without fine-tuning. The authors propose Concept Tokens—a method that freezes the pretrained model parameters while introducing learnable special tokens whose embeddings are optimized to distill semantic signals from multiple natural language definitions of a target concept. Evaluated on HotpotQA, the approach enables controllable regulation of hallucinatory behavior. In a second-language teaching feedback task, it outperforms conventional in-context learning that supplies full textual definitions. Case studies further reveal both the semantic capabilities and inherent limitations of the learned concept embeddings, demonstrating that meaningful conceptual guidance can be achieved through minimal, targeted parameter updates.

Technology Category

Application Category

📝 Abstract
We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional"Austral Tower"to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.
Problem

Research questions and friction points this paper is trying to address.

concept tokens
behavioral control
hallucination reduction
frozen LLMs
instruction compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Tokens
behavioral embeddings
frozen LLM
hallucination control
instruction compliance
🔎 Similar Papers
No similar papers found.
I
Ignacio Sastre
Instituto de Computación, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay
Aiala Rosá
Aiala Rosá
Universidad de la República
Natural Language Processing