🤖 AI Summary
Large language models (LLMs) exhibit limited capability in generating structured outputs strictly compliant with JSON Schema, primarily due to bottlenecks in schema understanding, string escaping handling, and natural-language-to-schema mapping. Method: We introduce SchemaBench—the first large-scale, high-coverage JSON Schema benchmark comprising over 40,000 diverse schemas—and propose a schema-aware reinforcement learning framework guided by a fine-grained syntactic validator, integrated with structured prompting for end-to-end optimization. Contribution/Results: Our approach significantly improves both JSON syntactic validity and schema adherence rates, substantially outperforming state-of-the-art baselines on SchemaBench. Moreover, it delivers measurable gains in downstream practical applications, such as API call generation and execution, demonstrating robust generalization across schema complexity and domain diversity.
📝 Abstract
This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.