Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Whether neural collapse and loss landscape flatness are causal prerequisites for generalization—or merely epiphenomenal byproducts—remains unresolved. Method: Leveraging the grokking phenomenon to decouple memorization and generalization phases, we integrate loss landscape analysis, representation clustering metrics, controlled regularization experiments, and theoretical derivations. Contribution/Results: We establish that flat minima constitute a *potential necessary condition* for generalization: explicit deviation from flat regions significantly delays generalization onset. In contrast, neural collapse exhibits only statistical co-occurrence with generalization, not causal influence: its active suppression or enhancement leaves final test performance unchanged. This work provides the first dual empirical-theoretical clarification of their causal hierarchy, revealing that their co-occurrence arises from geometric coupling in optimization dynamics—not functional dependency. Our findings yield a novel geometric framework for understanding generalization mechanisms in deep learning.

Technology Category

Application Category

📝 Abstract

Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization. Furthermore, we show theoretically that neural collapse implies relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.

Problem

Research questions and friction points this paper is trying to address.

Investigating whether neural collapse or flatness causes generalization in deep networks

Using grokking training regime to temporally separate generalization from training dynamics

Determining if flatness is more fundamental than neural collapse for generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using grokking to separate generalization dynamics

Demonstrating flatness is necessary for generalization

Showing neural collapse implies flatness theoretically

🔎 Similar Papers

No similar papers found.