🤖 AI Summary
This work addresses pervasive positional and linguistic biases in long-document embeddings, which systematically disadvantage content in later segments and low-resource languages during retrieval. The authors introduce, for the first time, a permutation-based evaluation framework to quantitatively measure the overrepresentation of document prefixes and high-resource languages (e.g., English) in mainstream embedding models. Building on this analysis, they propose a training-free, inference-time attention calibration mechanism that rebalances attention distributions across all document positions. Experimental results demonstrate that this approach significantly enhances the discoverability of trailing document segments and text in low-resource languages, thereby mitigating representational unfairness in existing embedding models without requiring model retraining.
📝 Abstract
To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly across document positions, increasing discoverabiltiy of later segments. Our evaluation framework and attention calibration is available at https://github.com/impresso/fair-sentence-transformers