LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

📅 2024-11-11

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Systematic evaluation of instruction-following capability and stability in large language models (LLMs) under long-context scenarios remains lacking. Method: This paper introduces LIFBench—the first dedicated benchmark for long-context instruction following—covering three application scenarios, eleven tasks, and 2,766 multidimensionally extended instructions. It proposes LIFEval, a parameter-free, rule-based automated evaluation framework that eliminates reliance on LLM-assisted scoring or human annotation, and pioneers quantitative analysis of stability degradation across three dimensions: context length, linguistic expression, and variable binding. Contribution/Results: Comprehensive evaluation across 20 mainstream LLMs and six context-length intervals reveals significant performance decay and instability across all models. This work establishes a scalable benchmark, an automated evaluation toolkit, and a multi-granularity diagnostic paradigm to advance research on long-context alignment.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' instruction-following in long-context scenarios

Assessing stability of LLMs across diverse long inputs

Lacking benchmarks for long-context instruction-following performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated expansion method for diverse instruction generation

Rubric-based assessment for precise automated scoring

Scalable dataset for long-context instruction evaluation

🔎 Similar Papers

No similar papers found.