OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the performance trade-off between head and tail classes and training instability in long-tailed recognition by decoupling the single-label classification task into two subtasks—head and tail—using a shared encoder with task-specific decoders. The authors derive a decomposable generalization error bound based on KL divergence and introduce computable bias-variance proxy metrics to automatically determine the optimal depth of the shared network and task-specific weights, enabling efficient hyperparameter optimization. Their approach integrates factorized modeling, Fisher information matrix analysis, and a three-stage training strategy—independent training, weighted joint training, and branch assembly—achieving significant improvements over strong baselines on standard long-tailed benchmarks while effectively mitigating the head-tail performance trade-off.

📝 Abstract

Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can increase training instability. Despite strong empirical results from re-weighting, decoupled training, and multi-expert methods, key design choices about representation sharing between head and tail classes and supervision weighting across class groups remain largely heuristic. In this work, we propose OSDTW, a principled task-decomposition framework that partitions the original single-label recognition problem into a head task and a tail task, implemented with a shared encoder and task-specific decoders. To handle the mutual exclusivity and statistical dependence between the two label groups, we introduce a factorized model and show that the resulting Kullback--Leibler divergence-based generalization error can be written as the sum of task-wise terms up to an additive constant, yielding a well-defined task-wise objective. We further develop a three-stage training pipeline: independent task training to estimate task-wise optima and the Fisher information matrix, weighted joint training to learn a shared encoder, and branch assembly to construct the final decoupled model. Under a block-diagonal Fisher approximation, we derive a computable second-order expansion of the expected generalization error, decomposing it into encoder variance, encoder bias, and decoder variance. This bias--variance decomposition provides a computable proxy to select the shared depth and task weights, enabling efficient hyper-parameter search. Experiments on standard long-tailed benchmarks demonstrate the effectiveness of the proposed approach over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

long-tailed recognition

head-tail trade-off

representation sharing

supervision weighting

generalization error

Innovation

Methods, ideas, or system contributions that make the work stand out.

task decomposition

bias-variance trade-off

Fisher information