When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work lacks systematic evaluation of large language models’ (LLMs) ability to simultaneously follow multiple instructions—a critical yet underexplored capability. Method: We introduce two dedicated benchmarks—ManyIFEval for text generation and StyleMBPP for code generation—covering diverse multi-instruction combinations. To enable efficient assessment, we propose lightweight regression models (e.g., logistic regression) that predict model performance using features such as instruction count, generalizing to unseen instruction sets and arbitrary instruction numbers. Results: Experiments reveal a pronounced performance degradation with increasing instruction count; our models achieve accurate predictions (within ~10% error) using only 300–500 samples, drastically reducing evaluation overhead. Our core contributions are: (1) the first systematic benchmarking framework for multi-instruction following, (2) a generalizable performance prediction methodology, and (3) the first quantitative characterization of the inverse relationship between instruction count and adherence performance.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to follow multiple instructions simultaneously
Measuring performance degradation as instruction count increases systematically
Developing models to estimate performance on unseen instruction combinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced two specialized multi-instruction benchmarks
Developed regression models for performance estimation
Logistic regression predicts performance with 10% error
🔎 Similar Papers
No similar papers found.
K
Keno Harada
The University of Tokyo
Y
Yudai Yamazaki
Kyoto University
M
Masachika Taniguchi
University of the Ryukyus
Edison Marrese-Taylor
Edison Marrese-Taylor
National Institute of Advanced Industrial Science and Technology (AIST)
Natural Language Processing - Machine Learning
T
Takeshi Kojima
The University of Tokyo
Yusuke Iwasawa
Yusuke Iwasawa
The University of Tokyo
deep learningtransfer learningfoundation modelmeta learning
Y
Yutaka Matsuo
The University of Tokyo