StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study systematically evaluates large language models’ (LLMs) ability to generate structured outputs—including renderable formats (e.g., HTML, React, SVG) and non-renderable formats (e.g., JSON, XML)—and investigates performance disparities between them. Method: We introduce the first comprehensive benchmark covering 18 structured formats, proposing a dual-paradigm evaluation framework for both *generation* and *format conversion*. We design a novel cross-format structural correctness metric, enabling systematic differentiation between renderable and non-renderable structure generation. Evaluation employs hybrid validation—rule-based + parser-based checking, AST-level structural comparison, and multi-granularity semantic consistency assessment. Contribution/Results: Experiments reveal that current state-of-the-art models (e.g., o1-mini) achieve only 75.58 average accuracy; open-source models lag by ~10 points. Generation tasks prove significantly more challenging than conversion tasks, and visual-structure error rates substantially exceed those for plain-text structures—highlighting a critical gap in LLMs’ structured output capabilities.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to generate diverse structured outputs

Assessing structural fidelity in non-renderable and renderable formats

Measuring performance gaps between generation and conversion tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces StructEval benchmark for structured outputs

Evaluates 18 formats with novel metrics

Tests generation and conversion tasks systematically

🔎 Similar Papers

No similar papers found.