🤖 AI Summary
Why do bidirectional language models (BiLMs) outperform unidirectional ones in natural language understanding? This paper provides the first theoretical justification grounded in Information Bottleneck (IB) theory, proving that BiLMs possess higher effective dimensionality and mutual information capacity. To this end, we propose FlowNIB—a dynamic, scalable, trainable mutual information estimator that overcomes the intractability of conventional IB computation and the rigidity of fixed IB trade-offs. FlowNIB enables online, layer-wise, task-adaptive information flow analysis, establishing a unified quantitative framework for measuring representation richness and compression fidelity. Extensive evaluation across multiple models (e.g., BERT, RoBERTa) and benchmarks (e.g., GLUE, SQuAD) confirms that bidirectional representations retain more semantically critical information while achieving superior information compression. Our work delivers the first open-source, scalable toolkit for online IB analysis of deep language models, substantially advancing interpretability research.
📝 Abstract
Bidirectional language models have better context understanding and perform better than unidirectional models on natural language understanding tasks, yet the theoretical reasons behind this advantage remain unclear. In this work, we investigate this disparity through the lens of the Information Bottleneck (IB) principle, which formalizes a trade-off between compressing input information and preserving task-relevant content. We propose FlowNIB, a dynamic and scalable method for estimating mutual information during training that addresses key limitations of classical IB approaches, including computational intractability and fixed trade-off schedules. Theoretically, we show that bidirectional models retain more mutual information and exhibit higher effective dimensionality than unidirectional models. To support this, we present a generalized framework for measuring representational complexity and prove that bidirectional representations are strictly more informative under mild conditions. We further validate our findings through extensive experiments across multiple models and tasks using FlowNIB, revealing how information is encoded and compressed throughout training. Together, our work provides a principled explanation for the effectiveness of bidirectional architectures and introduces a practical tool for analyzing information flow in deep language models.