POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation methodologies inadequately characterize large language models’ (LLMs) practical capabilities in bilingual governmental policy contexts, lacking scenario-adapted multidimensional criteria, task designs, and evaluation metrics. To address this gap, we propose the first comprehensive LLM benchmark tailored to bilingual governmental policy scenarios: (1) we release a timely, bilingual (Chinese–English) corpus of authentic governmental policy documents; (2) we design three realistic, policy-grounded tasks—clause comprehension, policy solution generation, and regulatory compliance judgment; and (3) we introduce a dual-dimensional evaluation framework integrating semantic similarity and factual accuracy. Using this benchmark, we systematically evaluate mainstream LLMs and find that reasoning-oriented models exhibit superior cross-task stability. Furthermore, we fine-tune lightweight POLIS-series models, which match or surpass strong proprietary baselines across multiple subtasks while significantly reducing inference cost and deployment overhead.

Technology Category

Application Category

📝 Abstract
We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval&Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in bilingual governmental policy scenarios using specialized tasks
Assesses model compliance and accuracy through dual-metric evaluation framework
Develops cost-effective models for real-world government deployment through fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs up-to-date bilingual policy corpus
Designs scenario-grounded specialized task framework
Establishes dual-metric semantic-accuracy evaluation system
🔎 Similar Papers
No similar papers found.