๐ค AI Summary
This work proposes a two-stage approach for speech-to-speech translation from the endangered Australian Aboriginal language Wardaman to English under an extremely low-resource setting of only six hours of labeled speech. To address the failure of end-to-end models in this scenario, the method first transcribes speech into phonemic sequences and then translates these sequences into English. It leverages cross-lingual transfer by initializing with a Sundanese speech model, integrates phoneme-level recognition, and enhances large language model inference with expert-curated lexical knowledge. The proposed system substantially outperforms multiple open-source and commercial large models, establishing the first strong baseline for WardamanโEnglish speech translation and demonstrating the efficacy of staged modeling combined with external knowledge guidance in ultra low-resource speech translation.
๐ Abstract
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.