PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the limited comprehension and reasoning capabilities of current large language models in the domain of public policy, as well as the absence of systematic evaluation benchmarks and targeted optimization approaches. To bridge this gap, the authors introduce PolicyBench—the first large-scale, multidimensional evaluation benchmark encompassing both Chinese and U.S. policy systems—and design a hierarchical assessment framework grounded in Bloom’s taxonomy of cognitive skills. Building upon this foundation, they propose PolicyMoE, a domain-specific mixture-of-experts architecture tailored to policy cognition levels. Through structured expert specialization and domain-adaptive training, PolicyMoE significantly enhances model performance on tasks involving policy memorization, comprehension, and application. Experimental results demonstrate that PolicyMoE outperforms general-purpose large language models on applied and structured reasoning tasks, effectively identifying and alleviating critical bottlenecks in policy understanding.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

Problem

Research questions and friction points this paper is trying to address.

public policy

large language models

policy comprehension

benchmarking

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

PolicyBench

PolicyMoE

Mixture-of-Experts