LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current LLM instruction-following evaluation faces three key challenges: (1) human evaluation is subjective and costly; (2) LLM-as-a-judge introduces systematic biases; and (3) programmatic benchmarks lack expressive power for fine-grained, compositional lexical constraints. To address these, we propose the first formal rule-based framework for fine-grained lexical instruction evaluation. Our method parses complex instructions into verifiable subject-predicate-object triples, constructs a human-in-the-loop, multi-stage data generation pipeline, and integrates both a programmable verification engine and LLM-as-a-judge comparative analysis. We publicly release a high-quality dataset and evaluation toolkit. This work enables the first objective, interpretable, and reproducible automated assessment of compositional lexical instructions—significantly improving evaluation transparency, granularity, and fidelity.

Technology Category

Application Category

📝 Abstract

The ability of Large Language Models (LLMs) to precisely follow complex and fine-grained lexical instructions is a cornerstone of their utility and controllability. However, evaluating this capability remains a significant challenge. Current methods either rely on subjective and costly human evaluation or on automated LLM-as-a-judge systems, which suffer from inherent biases and unreliability. Existing programmatic benchmarks, while objective, often lack the expressiveness to test intricate, compositional constraints at a granular level. To address these limitations, we introduce LexInstructEval, a new benchmark and evaluation framework for fine-grained lexical instruction following. Our framework is built upon a formal, rule-based grammar that deconstructs complex instructions into a canonicaltriplet. This grammar enables the systematic generation of a diverse dataset through a multi-stage, human-in-the-loop pipeline and facilitates objective verification via a transparent, programmatic engine. We release our dataset and open-source evaluation tools to facilitate further research into the controllability and reliability of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to follow complex lexical instructions precisely

Addressing limitations of subjective human evaluation and biased automated systems

Providing objective assessment of fine-grained compositional instruction constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grammar-based triplet decomposition for instruction analysis

Multi-stage human-in-the-loop dataset generation

Programmatic verification engine for objective evaluation

🔎 Similar Papers

No similar papers found.