Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study investigates how metaphors induce alignment biases in large language models during cross-domain reasoning. The authors demonstrate that metaphors present in training data can disrupt the model’s reasoning pathways, leading to failures in cross-domain alignment. To address this, they establish, for the first time, a causal link between metaphor usage and alignment bias, and introduce an analytical framework based on metaphor intervention. Their approach integrates causal intervention techniques, latent variable feature activation analysis, and a novel detector designed to identify alignment biases in hidden-layer representations. Experimental results show that metaphor-based interventions effectively modulate cross-domain alignment behavior, and the proposed detector accurately predicts bias content with high precision. This work offers a new perspective for understanding and mitigating alignment risks in large language models.

Technology Category

Application Category

πŸ“ Abstract
Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)'reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs'reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models'cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
Problem

Research questions and friction points this paper is trying to address.

metaphors
cross-domain misalignment
large language models
reasoning pathways
latent features
Innovation

Methods, ideas, or system contributions that make the work stand out.

metaphor
cross-domain misalignment
large language models
latent features
causal intervention
πŸ”Ž Similar Papers
No similar papers found.