Generating Structured Outputs from Language Models: Benchmark and Studies

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the low compliance and lack of systematic evaluation of constrained decoding techniques under realistic, complex constraints—particularly JSON Schema. We introduce JSONSchemaBench, the first large-scale structured generation benchmark comprising 10K real-world JSON schemas, and propose a multidimensional evaluation framework assessing compliance, constraint coverage, and output quality across six state-of-the-art methods. Our analysis reveals, for the first time, that compliance rates drop by over 40% for existing frameworks when handling nested, recursive, and conditional constraints. We further identify XGrammar and Outlines as achieving the best trade-off between inference efficiency and generation quality. The benchmark and evaluation suite are fully open-sourced, filling a critical gap in systematic evaluation for structured generation and advancing constrained decoding toward higher reliability and stronger generalization.

Technology Category

Application Category

📝 Abstract

Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at https://github.com/guidance-ai/jsonschemabench

Problem

Research questions and friction points this paper is trying to address.

Constraint Decoding

Language Model

Rule Compliance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive Evaluation System

Constraint Decoding

Rule Compliance

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval