Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses pervasive positional and linguistic biases in long-document embeddings, which systematically disadvantage content in later segments and low-resource languages during retrieval. The authors introduce, for the first time, a permutation-based evaluation framework to quantitatively measure the overrepresentation of document prefixes and high-resource languages (e.g., English) in mainstream embedding models. Building on this analysis, they propose a training-free, inference-time attention calibration mechanism that rebalances attention distributions across all document positions. Experimental results demonstrate that this approach significantly enhances the discoverability of trailing document segments and text in low-resource languages, thereby mitigating representational unfairness in existing embedding models without requiring model retraining.

Technology Category

Application Category

📝 Abstract
To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers
Problem

Research questions and friction points this paper is trying to address.

information representation fairness
long-document embeddings
positional bias
language bias
embedding-based search
Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding fairness
positional bias
attention calibration
long-document embeddings
multilingual bias
🔎 Similar Papers
No similar papers found.
E
Elias Schuhmacher
Department of Computational Linguistics, University of Zurich
A
Andrianos Michail
Department of Computational Linguistics, University of Zurich
Juri Opitz
Juri Opitz
University of Zurich
R
Rico Sennrich
Department of Computational Linguistics, University of Zurich
S
Simon Clematide
Department of Computational Linguistics, University of Zurich