Can Large Language Models Become Policy Refinement Partners? Evidence from China's Social Security Studies

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study systematically evaluates the feasibility of large language models (LLMs) as collaborative partners in optimizing China’s social security policies. Method: We propose a “context-embedded generation–adaptation” framework and conduct the first empirical comparison—using a multidimensional human evaluation system covering systemic coherence, stakeholder balance, fiscal risk assessment, and cultural adaptability—among DeepSeek-R1 (a regionally adapted LLM), GPT-4o, and human policy experts in generating actionable policy recommendations. Contribution/Results: DeepSeek-R1 significantly outperforms GPT-4o across all dimensions, validating the advantage of regionally adapted LLMs in domain-specific public policy contexts. While LLMs demonstrate strong capabilities in policy structuring and efficiently generating diverse, actionable options, they remain dependent on human expertise for modeling socio-dynamic complexity, mediating multi-stakeholder interests, and assessing long-term fiscal sustainability. This work establishes a methodological paradigm and empirical benchmark for integrating LLMs into evidence-informed public policy design.

Technology Category

Application Category

📝 Abstract

The rapid development of large language models (LLMs) is reshaping operational paradigms across multidisciplinary domains. LLMs' emergent capability to synthesize policy-relevant insights across disciplinary boundaries suggests potential as decision-support tools. However, their actual performance and suitability as policy refinement partners still require verification through rigorous and systematic evaluations. Our study employs the context-embedded generation-adaptation framework to conduct a tripartite comparison among the American GPT-4o, the Chinese DeepSeek-R1 and human researchers, investigating the capability boundaries and performance characteristics of LLMs in generating policy recommendations for China's social security issues. This study demonstrates that while LLMs exhibit distinct advantages in systematic policy design, they face significant limitations in addressing complex social dynamics, balancing stakeholder interests, and controlling fiscal risks within the social security domain. Furthermore, DeepSeek-R1 demonstrates superior performance to GPT-4o across all evaluation dimensions in policy recommendation generation, illustrating the potential of localized training to improve contextual alignment. These findings suggest that regionally-adapted LLMs can function as supplementary tools for generating diverse policy alternatives informed by domain-specific social insights. Nevertheless, the formulation of policy refinement requires integration with human researchers' expertise, which remains critical for interpreting institutional frameworks, cultural norms, and value systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capability in policy refinement for social security

Comparing GPT-4o, DeepSeek-R1, and humans in policy recommendations

Assessing limitations of LLMs in complex social dynamics and fiscal risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-embedded generation-adaptation framework for evaluation

Tripartite comparison among GPT-4o, DeepSeek-R1, humans

Localized training improves contextual policy recommendation alignment

🔎 Similar Papers

How Chinese are Chinese Language Models? The Puzzling Lack of Language Policy in China's LLMs