WildIFEval: Instruction Following in the Wild

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit significant performance degradation on multi-constraint instruction-following tasks as the number of constraints increases, yet no systematic benchmark exists for rigorous evaluation. Method: We introduce WildIFEval—the first real-world, multi-constraint instruction dataset (12K samples)—featuring human-annotated, high-dimensional semantic constraints across eight categories spanning lexical, topical, and logical dimensions. We propose a constraint taxonomy-guided data construction paradigm with expert curation and design a multi-model collaborative evaluation framework. Results: Empirical analysis reveals that state-of-the-art LLMs suffer >30% average performance drop on multi-constraint instructions, with constraint-type-specific variations inducing up to 42% performance fluctuation—highlighting critical bottlenecks in constraint reasoning and integration. This work establishes the first high-quality, reproducible benchmark for complex instruction evaluation and provides foundational resources for constraint-aware modeling.

Technology Category

Application Category

📝 Abstract

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on multi-constraint instruction following

Introduces WildIFEval dataset with 12K real user instructions

Identifies performance degradation with increasing constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces WildIFEval dataset with 12K real user instructions

Categorizes constraints into eight high-level classes

Benchmarks LLMs on multi-constraint instruction-following tasks

🔎 Similar Papers

R+X: Retrieval and Execution from Everyday Human Videos