🤖 AI Summary
This work investigates whether bidirectional attention can overcome the inherent limitations of unidirectional attention in autoregressive large language models (LLMs) regarding semantic representation capability. We propose the first progressive bidirectional attention integration method tailored for the Llama architecture, enhancing semantic understanding without compromising generative performance. Our approach employs multi-stage fine-tuning that jointly incorporates bidirectional attention with unsupervised and supervised contrastive learning. We systematically evaluate the resulting model across word embeddings, diagnostic probing tasks, and downstream understanding applications—including text similarity and classification. Experimental results demonstrate substantial improvements in semantic encoding capacity; probing analyses confirm the acquisition of richer, more hierarchical semantic features. Moreover, consistent performance gains are observed across diverse comprehension-oriented benchmarks, underscoring the critical role of attention directionality in representation quality.
📝 Abstract
Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism.
This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning.