DiffScore: Text Evaluation Beyond Autoregressive Likelihood

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of autoregressive language models in text quality evaluation, which suffer from positional bias due to their unidirectional context. To overcome this, the authors propose DiffScore, a novel paradigm based on masked reconstruction that leverages a large diffusion language model to assess text recoverability across a continuum of masking rates, enabling hierarchical evaluation from local fluency to global coherence. The method introduces a bidirectional contextual scoring mechanism to eliminate positional bias and employs multi-timestep quality profiling and bidirectional pointwise mutual information (PMI) decomposition to disentangle fluency and faithfulness. Experiments across ten benchmarks demonstrate that DiffScore significantly outperforms existing autoregressive baselines under both zero-shot and fine-tuned settings.

📝 Abstract

Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.

Problem

Research questions and friction points this paper is trying to address.

autoregressive bias

positional bias

text evaluation

masked reconstruction

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked reconstruction

diffusion language models

positional bias