When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Standard cross-entropy loss in image classification disregards the semantic hierarchical relationships among classes, penalizing all misclassifications equally regardless of their semantic proximity. This work proposes Hierarchical-Aware Cross-Entropy (HACE), which for the first time directly incorporates class hierarchy into the loss function by leveraging prediction aggregation and ancestor label smoothing to enable optimization sensitive to semantic distance—all without modifying the model architecture. Serving as a plug-and-play replacement for standard cross-entropy, HACE achieves an average accuracy improvement of 4.66% over baseline models when trained end-to-end on CIFAR-100, FGVC Aircraft, and NABirds. Furthermore, in linear probing evaluations using DINOv2-Large features, HACE outperforms the best baseline by an average of 2.18%.

📝 Abstract

Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class hierarchy to ensure that parent nodes accumulate the confidence of their children; and ancestral label smoothing, which distributes the ground-truth signal along the path from the true class to the root. We evaluate HACE on CIFAR-100, FGVC Aircraft, and NABirds in two regimes: end-to-end training across six architectures spanning convolutional and attention-based designs, and linear probing on frozen DINOv2-Large features. In end-to-end training, HACE improves accuracy over standard cross-entropy in 15 out of 18 architecture--dataset pairs, with a mean gain of 4.66\%. In linear probing on frozen DINOv2-Large features, HACE outperforms all competing methods on all three datasets, with a mean improvement of 2.18\% over the next best baseline.

Problem

Research questions and friction points this paper is trying to address.

image classification

class hierarchy

cross-entropy loss

semantic distance

hierarchical labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchy-Aware Cross-Entropy

class hierarchy

prediction aggregation