🤖 AI Summary
Existing evaluation methodologies inadequately characterize large language models’ (LLMs) practical capabilities in bilingual governmental policy contexts, lacking scenario-adapted multidimensional criteria, task designs, and evaluation metrics. To address this gap, we propose the first comprehensive LLM benchmark tailored to bilingual governmental policy scenarios: (1) we release a timely, bilingual (Chinese–English) corpus of authentic governmental policy documents; (2) we design three realistic, policy-grounded tasks—clause comprehension, policy solution generation, and regulatory compliance judgment; and (3) we introduce a dual-dimensional evaluation framework integrating semantic similarity and factual accuracy. Using this benchmark, we systematically evaluate mainstream LLMs and find that reasoning-oriented models exhibit superior cross-task stability. Furthermore, we fine-tune lightweight POLIS-series models, which match or surpass strong proprietary baselines across multiple subtasks while significantly reducing inference cost and deployment overhead.
📝 Abstract
We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval&Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale evaluation of over 10 state-of-the-art LLMs on POLIS-Bench reveals a clear performance hierarchy where reasoning models maintain superior cross-task stability and accuracy, highlighting the difficulty of compliance tasks. Furthermore, leveraging our benchmark, we successfully fine-tune a lightweight open-source model. The resulting POLIS series models achieves parity with, or surpasses, strong proprietary baselines on multiple policy subtasks at a significantly reduced cost, providing a cost-effective and compliant path for robust real-world governmental deployment.