Understanding Generative AI Capabilities in Everyday Image Editing Tasks

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study systematically investigates the practical capability boundaries of generative AI for everyday image editing. Method: Leveraging 83K real-world user requests and 305K professional human edits, we conduct a comprehensive evaluation integrating Reddit community analysis, multi-model benchmarking (GPT-4o, Gemini-2.0-Flash, SeedEdit, o1), human annotation, and qualitative case studies. Contribution/Results: We find that current AI editors reliably fulfill only 33% of everyday editing requests; they underperform significantly on precise tasks—especially identity preservation—compared to creative editing; and vision-language model assessments exhibit systematic misalignment with human preferences. Key failure modes identified include identity distortion and unintended modifications. To support reproducible research, we open-source the PSR dataset and an end-to-end evaluation framework, providing empirically grounded design principles and a standardized benchmark for AI-powered photo editing systems.

Technology Category

Application Category

📝 Abstract

Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on March 25, 2025. However, what subjects do people most often want edited? What kinds of editing actions do they want to perform (e.g., removing or stylizing the subject)? Do people prefer precise edits with predictable outcomes or highly creative ones? By understanding the characteristics of real-world requests and the corresponding edits made by freelance photo-editing wizards, can we draw lessons for improving AI-based editors and determine which types of requests can currently be handled successfully by AI editors? In this paper, we present a unique study addressing these questions by analyzing 83k requests from the past 12 years (2013-2025) on the Reddit community, which collected 305k PSR-wizard edits. According to human ratings, approximately only 33% of requests can be fulfilled by the best AI editors (including GPT-4o, Gemini-2.0-Flash, SeedEdit). Interestingly, AI editors perform worse on low-creativity requests that require precise editing than on more open-ended tasks. They often struggle to preserve the identity of people and animals, and frequently make non-requested touch-ups. On the other side of the table, VLM judges (e.g., o1) perform differently from human judges and may prefer AI edits more than human edits. Code and qualitative examples are available at: https://psrdataset.github.io

Problem

Research questions and friction points this paper is trying to address.

Identify common subjects and editing actions in everyday image requests

Compare AI editor performance on precise versus creative editing tasks

Evaluate discrepancies between human and AI judge preferences in edits

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing 83k image editing requests from Reddit

Comparing AI and human editing performance

Evaluating AI editors with human and VLM judges

🔎 Similar Papers

Tackling copyright issues in AI image generation through originality estimation and genericization