PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the absence of open-domain question answering benchmarks for evaluating reasoning capabilities in Persian by introducing PARSE, the first Persian open-domain reasoning QA dataset, comprising 10,800 diverse samples spanning boolean, multiple-choice, and factoid questions. High data quality and varied difficulty levels are ensured through a pipeline involving controlled large language model generation, multi-stage filtering, human annotation, and consistency validation. Experimental results demonstrate that combining Persian-specific prompts with structured prompting strategies significantly enhances the performance of both multilingual and Persian-specialized models, with further gains achieved through fine-tuning. These findings confirm PARSE’s effectiveness and practical utility in enabling fair model comparison and adaptation for Persian-language reasoning tasks.

Technology Category

Application Category

📝 Abstract

Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.

Problem

Research questions and friction points this paper is trying to address.

Persian

reasoning

question answering

benchmark

low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Persian QA

reasoning benchmark

LLM-based generation