Still Not There: Can LLMs Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) can outperform specialized small models in the low-resource, morphologically complex Sanskrit language on the task of poetic-to-prose (anvaya) transformation—a challenging problem requiring joint handling of free word order, compound segmentation, metrical constraints, dependency parsing, and syntactic linearization. The authors propose a structured prompt template inspired by Pāṇinian grammar and classical commentaries, and systematically compare instruction-tuned LLMs, in-context learning, and domain-adapted ByT5-Sanskrit. Results show that domain-finetuned ByT5-Sanskrit significantly surpasses all LLM variants—including GPT-4, Claude, and open-weight base models—on both human evaluation and Kendall’s Tau automatic metric, with high inter-metric agreement and strong cross-domain generalization. This study demonstrates that, under extreme morphological complexity and data scarcity, carefully engineered lightweight domain-specific models retain irreplaceable advantages over general-purpose LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall's Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating if LLMs outperform specialized models for Sanskrit poetry conversion

Testing LLMs on low-resource morphologically rich languages like Sanskrit

Comparing instruction-tuned LLMs with task-specific Seq2Seq models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction fine-tuning for general-purpose LLMs

In-context learning with Paninian grammar templates

Domain-specific fine-tuning of ByT5-Sanskrit Seq2Seq model

🔎 Similar Papers

Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving