🤖 AI Summary
To address the lack of toxicity detection systems for low-resource Indic languages, this paper introduces UnityAI-Guard—the first open-source framework systematically supporting toxicity detection across seven low-resource Brahmic-script Indian languages (e.g., Marathi, Gujarati). Methodologically, it integrates fine-tuned multilingual pre-trained language models, script-aware feature encoding, data augmentation, and robustness optimization, and introduces a high-quality, human-verified cross-lingual dataset comprising 35k samples. Its key contribution lies in the first unified modeling and standardized evaluation across Brahmic scripts. Experiments demonstrate an average F1-score of 84.23% across all seven languages—significantly outperforming existing baselines. To foster equitable content safety governance, the authors publicly release both the trained models and a RESTful API.
📝 Abstract
This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 888k training instances and 35k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.