Embeddings for Preferences, Not Semantics

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitation of existing text embeddings, which rely on semantic similarity and fail to accurately capture preference alignment among participants in collective decision-making. The authors propose a novel preference-oriented text embedding method that formally articulates the invariance problem in preference embedding for the first time. By disentangling stance-related values from confounding semantic factors, the approach constructs a vector space aligned with preference structures. It leverages synthetic data generation to break spurious correlations between semantics and preferences, and integrates contrastive learning, a preference prediction scorer, and geometric modeling based on fair clustering and facility location. Evaluated across eleven online deliberation datasets, the method significantly outperforms standard semantic embeddings and achieves state-of-the-art performance in preference prediction tasks.

📝 Abstract

Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textit{preferential similarity}: a participant's agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a preference-relevant signal (stance and values) and semantic nuisance (style and wording), and the two are observationally correlated, so a geometry that relies on nuisance can appear preference-correct even when it is not. We show that synthetic training data designed to break this correlation provably shifts the optimal scorer away from nuisance-dominated cosine and significantly improves preference prediction across 11 online deliberation datasets.

Problem

Research questions and friction points this paper is trying to address.

preference embedding

semantic similarity

preferential similarity

nuisance factors

collective decision-making

Innovation

Methods, ideas, or system contributions that make the work stand out.

preferential similarity

text embeddings

invariance