Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This paper identifies a systematic misalignment between human and large language model (LLM) operational semantics for subjective adjectives (e.g., “witty,” “enthusiastic”), undermining intent alignment. To address this, we propose TED: a method that automatically constructs an LLM operational semantics lexicon via prompt perturbation and response clustering, then compares it against a human-annotated reference lexicon—without requiring supervised labels—to detect latent misalignments among abstract concepts. TED enables the first interpretable and scalable detection of subjective semantic alignment biases. Experiments on Mistral-7B and Llama-3-8B reveal counterintuitive safety-critical failures—for instance, “witty” eliciting harassment and “enthusiastic” generating hallucinated content—demonstrating that subjective semantic misalignment constitutes a fundamental threat to model safety and controllability.

Technology Category

Application Category

📝 Abstract

Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM's operational semantics of such subjective phrases -- how it adjusts its behavior when each phrase is included in the prompt -- thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs' actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.

Problem

Research questions and friction points this paper is trying to address.

Misalignment between LLMs' operational semantics and human expectations.

Detecting discrepancies in how LLMs interpret subjective language phrases.

Uncovering unexpected LLM behaviors through abstract concept relationships.

Innovation

Methods, ideas, or system contributions that make the work stand out.

TED detects LLM-human semantic misalignments

Constructs thesaurus for LLM operational semantics

Identifies unexpected LLM behaviors via abstract concepts

🔎 Similar Papers

No similar papers found.