IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM instruction-following evaluation faces bottlenecks in high computational cost and low reliability. To address this, we propose IF-CRITIC, a fine-grained critique model. Methodologically, it introduces (1) a constraint checklist generation mechanism coupled with multi-stage critical filtering to construct high-quality, structured critique data; and (2) a constraint-level preference optimization framework that enables hierarchical, granular assessment grounded in instruction decomposition. Leveraging large language models as evaluators, IF-CRITIC balances accuracy and efficiency. Experiments demonstrate that IF-CRITIC significantly outperforms strong baselines—including DeepSeek-R1 and o4-mini—achieving substantial gains in both assessment accuracy and stability, while reducing computational overhead by over 30%. This work establishes a novel paradigm for efficient and reliable instruction-following evaluation.

Technology Category

Application Category

📝 Abstract
Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.
Problem

Research questions and friction points this paper is trying to address.

Evaluating instruction-following in LLMs with fine-grained constraint assessments
Reducing costs and improving reliability of LLM-as-a-Judge evaluations
Providing scalable reward signals for efficient instruction-following optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Checklist generator decomposes instructions into constraints
Multi-stage critique filtering collects high-quality training data
Constraint-level preference optimization trains efficient LLM critic
🔎 Similar Papers
No similar papers found.