Representation Invariance and Allocation: When Subgroup Balance Matters

šŸ“… 2025-12-10
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Does imbalanced subgroup distribution in training data necessarily impair model generalization across subgroups? Recent counterintuitive observations—such as stable performance despite subgroup underrepresentation—challenge the default ā€œbalance-is-optimalā€ assumption. Method: We propose the latent-space subgroup separation hypothesis and empirically test it across four major vision-language models (ViT, CLIP, BLIP, and LLaVA) via systematic data ablation experiments, latent-space geometric analysis, and theoretical modeling. Contribution: We establish, for the first time, that the degree of subgroup separability in the pre-trained model’s latent space is the key mechanism governing its sensitivity to training data imbalance. This separability is quantitatively predictive of subgroup-wise performance robustness. Moreover, it provides principled guidance for fair fine-tuning—informing both data acquisition priorities and optimal balancing strategies—thereby bridging latent geometry with practical fairness interventions.

Technology Category

Application Category

šŸ“ Abstract
Unequal representation of demographic groups in training data poses challenges to model generalisation across populations. Standard practice assumes that balancing subgroup representation optimises performance. However, recent empirical results contradict this assumption: in some cases, imbalanced data distributions actually improve subgroup performance, while in others, subgroup performance remains unaffected by the absence of an entire subgroup during training. We conduct a systematic study of subgroup allocation across four vision and language models, varying training data composition to characterise the sensitivity of subgroup performance to data balance. We propose the latent separation hypothesis, which states that a partially fine-tuned model's dependence on subgroup representation is determined by the degree of separation between subgroups in the latent space of the pre-trained model. We formalise this hypothesis, provide theoretical analysis, and validate it empirically. Finally, we present a practical application to foundation model fine-tuning, demonstrating that quantitative analysis of latent subgroup separation can inform data collection and balancing decisions.
Problem

Research questions and friction points this paper is trying to address.

Investigates subgroup performance sensitivity to training data balance.
Proposes latent separation hypothesis for representation invariance.
Applies analysis to inform data collection for model fine-tuning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes latent separation hypothesis for subgroup performance
Analyzes subgroup allocation across vision and language models
Uses latent space analysis to guide data balancing decisions
A
Anissa Alloula
University of Oxford
C
Charles Jones
Imperial College London
Z
Zuzanna Wakefield-Skorniewska
University of Oxford
Francesco Quinzan
Francesco Quinzan
University of Oxford
B
Bartłomiej W. Papież
University of Oxford