The Veln(ia)s is in the Details: Evaluating LLM Judgment on Latvian and Lithuanian Short Answer Matching

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses short-answer matching for Baltic languages by introducing the first fine-grained evaluation benchmark for Latvian and Lithuanian—comprising 1,192 high-quality question-answer pairs—with semantically preserved, human-verified adversarial non-matching answers. We propose a rule-driven method for generating answer perturbations and design a Baltic-language-specific fine-grained matching evaluation framework. Under zero-shot and few-shot settings, we systematically evaluate cross-scale models—including Qwen2.5, LLaMA3.1, Mistral, and EuroLLM—revealing that smaller models (e.g., Qwen2.5-7B, Mistral-7B) achieve near-98% accuracy, outperforming certain 70B variants; notably, Mistral-Nemo-12B exhibits significant performance degradation on Lithuanian, confirming model size is not a decisive factor for discriminative capability. Our benchmark and framework establish a new paradigm for fine-grained semantic matching evaluation in low-resource languages.

Technology Category

Application Category

📝 Abstract

In this work, we address the challenge of evaluating large language models (LLMs) on the short answer matching task for Latvian and Lithuanian languages. We introduce novel datasets consisting of 502 Latvian and 690 Lithuanian question-answer pairs. For each question-answer pair, we generated matched and non-matched answers using a set of alteration rules specifically designed to introduce small but meaningful changes in the text. These generated answers serve as test cases to assess the ability of LLMs to detect subtle differences in matching of the original answers. A subset of the datasets was manually verified for quality and accuracy. Our results show that while larger LLMs, such as QWEN2.5 72b and LLaMa3.1 70b, demonstrate near-perfect performance in distinguishing matched and non-matched answers, smaller models show more variance. For instance, LLaMa3.1 8b and EuroLLM 9b benefited from few-shot examples, while Mistral Nemo 12b underperformed on detection of subtle text alteration, particularly in Lithuanian, even with additional examples. QWEN2.5 7b and Mistral 7b were able to obtain a strong and comparable performance to the larger 70b models in zero and few shot experiments. Moreover, the performance of Mistral 7b was weaker in few shot experiments.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Latvian and Lithuanian Languages

Short Answer Matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Fine-grained Differentiation

Cross-lingual Performance

🔎 Similar Papers

No similar papers found.