🤖 AI Summary
Existing lexical simplification (LS) evaluation methods focus solely on individual difficult word substitution, failing to assess sentence-level simplification quality—particularly in contextual modeling and stepwise simplification. This work proposes the first end-to-end, sentence-level LS evaluation paradigm tailored for large language models (LLMs). We design a human-in-the-loop, full-coverage annotation protocol and develop a multi-LLM collaborative framework that explicitly simulates the three-stage LS process—identification, substitution, and reconstruction—thereby overcoming the limitations of single-prompt simplification. Evaluated on a newly constructed benchmark, our method significantly outperforms all baseline approaches. To the best of our knowledge, this is the first systematic, reproducible assessment of LLMs’ holistic sentence simplification capability. The results empirically validate the effectiveness and advancement of the proposed end-to-end evaluation paradigm.
📝 Abstract
Lexical Simplification (LS) methods use a three-step pipeline: complex word identification, substitute generation, and substitute ranking, each with separate evaluation datasets. We found large language models (LLMs) can simplify sentences directly with a single prompt, bypassing the traditional pipeline. However, existing LS datasets are not suitable for evaluating these LLM-generated simplified sentences, as they focus on providing substitutes for single complex words without identifying all complex words in a sentence. To address this gap, we propose a new annotation method for constructing an all-in-one LS dataset through human-machine collaboration. Automated methods generate a pool of potential substitutes, which human annotators then assess, suggesting additional alternatives as needed. Additionally, we explore LLM-based methods with single prompts, in-context learning, and chain-of-thought techniques. We introduce a multi-LLMs collaboration approach to simulate each step of the LS task. Experimental results demonstrate that LS based on multi-LLMs approaches significantly outperforms existing baselines.