A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the cross-lingual zero-shot generalization capabilities of mainstream large language models (LLMs) on Sanskrit, Ancient Greek, and Latin across three tasks: named entity recognition, machine translation, and Sanskrit factoid question answering. Method: We conduct the first zero-shot, non-fine-tuned evaluation of models including GPT-4o and Llama-3.1 on multiple classical languages; introduce the first benchmark Sanskrit factoid QA dataset; and assess retrieval-augmented generation (RAG) for enhancing zero-shot performance. Contribution/Results: Model scale emerges as the primary determinant of zero-shot transferability—smaller models exhibit substantial performance degradation. RAG significantly improves zero-shot QA accuracy, validating context augmentation for low-resource classical languages. Notably, top-tier LLMs match or surpass specialized fine-tuned baselines across multiple tasks, demonstrating the untapped potential of large-scale pretraining for classical language understanding.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages -- Sanskrit, Ancient Greek and Latin -- to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question-answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.
Problem

Research questions and friction points this paper is trying to address.

Investigates cross-lingual zero-shot generalization in classical languages
Evaluates LLM performance on niche tasks like Sanskrit QA
Identifies model scale as key factor for generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual zero-shot generalization in classical languages
Retrieval-augmented generation boosts Sanskrit QA performance
Model scale crucial for cross-lingual generalization success
🔎 Similar Papers
No similar papers found.