🤖 AI Summary
Existing token-based multi-task learning frameworks (e.g., TokenVerse) require fully annotated samples across all tasks, severely limiting their applicability to partially labeled data and scalability. To address this, we propose a dynamic task activation mechanism: learnable task vectors are introduced into the acoustic embedding space of the XLSR-Transducer model, enabling sample-wise adaptive activation of task-specific heads based solely on the available label subset. This is the first approach to support subset-label training—samples contribute to multi-task optimization without requiring annotations for all tasks. Experiments on automatic speech recognition (ASR) and language identification demonstrate performance on par with or superior to TokenVerse, while significantly improving modeling efficiency and framework flexibility on incompletely annotated data.
📝 Abstract
Token-based multitasking frameworks like TokenVerse require all training utterances to have labels for all tasks, hindering their ability to leverage partially annotated datasets and scale effectively. We propose TokenVerse++, which introduces learnable vectors in the acoustic embedding space of the XLSR-Transducer ASR model for dynamic task activation. This core mechanism enables training with utterances labeled for only a subset of tasks, a key advantage over TokenVerse. We demonstrate this by successfully integrating a dataset with partial labels, specifically for ASR and an additional task, language identification, improving overall performance. TokenVerse++ achieves results on par with or exceeding TokenVerse across multiple tasks, establishing it as a more practical multitask alternative without sacrificing ASR performance.