🤖 AI Summary
To address the challenge of partially missing labels in multi-task dense prediction, this paper proposes Hierarchical Task Tokenization (HTT), a framework that automatically discovers pixel-level consistent cross-task supervision signals. Methodologically, HTT introduces the first joint modeling mechanism of global and fine-grained spatial task tokens; learnable hierarchical tokens guide cross-task feature interaction and collaboratively generate multi-scale pseudo-labels, optimized end-to-end via distillation-based supervision. The core contribution is a task-token-driven paradigm for multi-granularity supervision discovery, eliminating reliance on fully annotated data. Extensive experiments demonstrate state-of-the-art performance on NYUD-v2, Cityscapes, and PASCAL Context, with joint multi-task improvements of 3.2–5.7 mIoU over prior methods.
📝 Abstract
In recent years, simultaneous learning of multiple dense prediction tasks with partially annotated label data has emerged as an important research area. Previous works primarily focus on constructing cross-task consistency or conducting adversarial training to regularize cross-task predictions, which achieve promising performance improvements, while still suffering from the lack of direct pixel-wise supervision for multi-task dense predictions. To tackle this challenge, we propose a novel approach to optimize a set of learnable hierarchical task tokens, including global and fine-grained ones, to discover consistent pixel-wise supervision signals in both feature and prediction levels. Specifically, the global task tokens are designed for effective cross-task feature interactions in a global context. Then, a group of fine-grained task-specific spatial tokens for each task is learned from the corresponding global task tokens. It is embedded to have dense interactions with each task-specific feature map. The learned global and local fine-grained task tokens are further used to discover pseudo task-specific dense labels at different levels of granularity, and they can be utilized to directly supervise the learning of the multi-task dense prediction framework. Extensive experimental results on challenging NYUD-v2, Cityscapes, and PASCAL Context datasets demonstrate significant improvements over existing state-of-the-art methods for partially annotated multi-task dense prediction.