🤖 AI Summary
Argument mining for low-resource languages—such as Persian—faces severe data scarcity, hindering effective model training. Method: This paper proposes a lightweight cross-lingual hybrid training strategy that integrates zero-shot transfer, large language model (LLM)-based synthetic data augmentation, and cross-lingual joint fine-tuning. It leverages abundant English resources while requiring only a small amount of manually translated Persian data, avoiding reliance on large-scale parametric augmentation or fully bilingual annotated corpora. Contribution/Results: The approach significantly reduces annotation cost and achieves 74.8% F1 on Persian argument mining—outperforming existing LLM-augmented baselines. It demonstrates strong efficacy and generalizability in low-resource settings, establishing a scalable new paradigm for structured argumentation analysis in resource-constrained languages.
📝 Abstract
Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2% on the English test set and 50.7% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2% on English and 69.3% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.