Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses pervasive positional and linguistic biases in long-document embeddings, which systematically disadvantage content in later segments and low-resource languages during retrieval. The authors introduce, for the first time, a permutation-based evaluation framework to quantitatively measure the overrepresentation of document prefixes and high-resource languages (e.g., English) in mainstream embedding models. Building on this analysis, they propose a training-free, inference-time attention calibration mechanism that rebalances attention distributions across all document positions. Experimental results demonstrate that this approach significantly enhances the discoverability of trailing document segments and text in low-resource languages, thereby mitigating representational unfairness in existing embedding models without requiring model retraining.

Technology Category

Application Category

📝 Abstract

To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers

Problem

Research questions and friction points this paper is trying to address.

information representation fairness

long-document embeddings

positional bias

language bias

embedding-based search

Innovation

Methods, ideas, or system contributions that make the work stand out.

embedding fairness

positional bias

attention calibration