🤖 AI Summary
Existing large language models (LLMs) exhibit significant performance degradation on multi-constraint instruction-following tasks as the number of constraints increases, yet no systematic benchmark exists for rigorous evaluation. Method: We introduce WildIFEval—the first real-world, multi-constraint instruction dataset (12K samples)—featuring human-annotated, high-dimensional semantic constraints across eight categories spanning lexical, topical, and logical dimensions. We propose a constraint taxonomy-guided data construction paradigm with expert curation and design a multi-model collaborative evaluation framework. Results: Empirical analysis reveals that state-of-the-art LLMs suffer >30% average performance drop on multi-constraint instructions, with constraint-type-specific variations inducing up to 42% performance fluctuation—highlighting critical bottlenecks in constraint reasoning and integration. This work establishes the first high-quality, reproducible benchmark for complex instruction evaluation and provides foundational resources for constraint-aware modeling.
📝 Abstract
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.