UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Existing evaluation methods for visual editing are fragmented and lack a unified benchmark, with video editing assessment particularly underdeveloped. Moreover, automatic metrics often misalign with human preferences, while large-model-based evaluations incur prohibitive computational costs. To address these limitations, this work proposes UniEditBench—the first unified benchmark for both image and video editing—encompassing nine image and eight video editing categories, along with a structured task taxonomy to support complex editing compositions. Leveraging the Qwen3-VL-235B-A22B Instruct large vision-language model, we distill lightweight 4B/8B evaluators via knowledge distillation that efficiently produce multidimensional scores across key criteria, including structural fidelity, text alignment, background consistency, naturalness, and spatiotemporal coherence. These distilled evaluators achieve high correlation with human judgments while drastically reducing deployment costs, offering a reproducible and scalable evaluation protocol for visual editing. The benchmark and associated reward models are publicly released.

Technology Category

Application Category

📝 Abstract
The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.
Problem

Research questions and friction points this paper is trying to address.

visual editing evaluation
unified benchmark
video editing
automatic metrics
human preference alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified benchmark
distilled MLLMs
visual editing evaluation
cost-effective evaluation
multimodal knowledge distillation