HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing analogy reasoning benchmarks lack coverage of Indo-Aryan languages—particularly Hindi—hindering fair, cross-lingual evaluation of multilingual large language models’ reasoning capabilities. Method: We introduce HATS, the first Hindi-specific analogy reasoning benchmark, comprising 405 multiple-choice questions sourced from official Indian government examinations. Methodologically, we pioneer the integration of embodied cognition theory into prompt engineering, proposing Embodied Chain-of-Thought (ECOT)—a novel prompting strategy that grounds reasoning in sensorimotor and cultural context—and systematically evaluate multilingual models under monolingual (Hindi/English) and cross-lingual prompting settings. Contribution/Results: Experiments show English prompts retain an advantage, yet ECOT yields substantial gains for Hindi reasoning (+12.3% average accuracy). HATS fills a critical resource gap in Hindi NLP evaluation and establishes a new cognitive-inspired benchmark and methodology for multilingual reasoning assessment.

Technology Category

Application Category

📝 Abstract

Analogies test a model's ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning in Hindi for large language models

Addressing lack of Indic language analogy test sets

Improving model performance on Hindi analogies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Hindi Analogy Test Set (HATS)

Uses grounded Chain of Thought approach

Benchmarks multilingual LLMs with prompts

🔎 Similar Papers

No similar papers found.