POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluation methodologies inadequately characterize large language models’ (LLMs) practical capabilities in bilingual governmental policy contexts, lacking scenario-adapted multidimensional criteria, task designs, and evaluation metrics. To address this gap, we propose the first comprehensive LLM benchmark tailored to bilingual governmental policy scenarios: (1) we release a timely, bilingual (Chinese–English) corpus of authentic governmental policy documents; (2) we design three realistic, policy-grounded tasks—clause comprehension, policy solution generation, and regulatory compliance judgment; and (3) we introduce a dual-dimensional evaluation framework integrating semantic similarity and factual accuracy. Using this benchmark, we systematically evaluate mainstream LLMs and find that reasoning-oriented models exhibit superior cross-task stability. Furthermore, we fine-tune lightweight POLIS-series models, which match or surpass strong proprietary baselines across multiple subtasks while significantly reducing inference cost and deployment overhead.

Technology Category

Application Category

📝 Abstract

We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval&Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs in bilingual governmental policy scenarios using specialized tasks

Assesses model compliance and accuracy through dual-metric evaluation framework

Develops cost-effective models for real-world government deployment through fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs up-to-date bilingual policy corpus

Designs scenario-grounded specialized task framework

Establishes dual-metric semantic-accuracy evaluation system

🔎 Similar Papers

No similar papers found.

Authors to Follow