A Compression Perspective on Simplicity Bias

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work investigates why deep neural networks exhibit a preference for learning simple functions—a phenomenon known as "simplicity bias." Framing supervised learning through the lens of the Minimum Description Length (MDL) principle, we formalize it as an optimal two-part lossless compression problem, thereby providing the first rigorous information-theoretic characterization of this bias. Both theoretical analysis and empirical results demonstrate that as dataset size grows, the model undergoes a phase transition in feature selection, shifting from reliance on shortcut features to more complex yet robust ones. Crucially, data scale itself acts as an implicit regularizer controlling model complexity. On semi-synthetic benchmarks, the feature learning trajectory of neural networks closely aligns with that of an optimal compressor, revealing critical data regimes that either promote robustness or suppress unreliable complex features.

Technology Category

Application Category

📝 Abstract

Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

Problem

Research questions and friction points this paper is trying to address.

simplicity bias

feature selection

minimum description length

neural networks

data regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimum Description Length

simplicity bias

two-part compression