AesFormer: Transform Everyday Photos into Beautiful Memories

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge of enhancing photographic aesthetics impaired by structural flaws such as poor composition, viewpoint, or pose—issues that existing image editing methods struggle to rectify effectively. To this end, we introduce the Aesthetic Photo Reconstruction (APR) task, which decouples aesthetic understanding from image editing for the first time. Our approach employs a two-stage framework: first generating actionable aesthetic editing instructions grounded in seven established photographic principles, then executing structural refinements via an action-conditioned editor. We further contribute the AesRecon dataset and the VCMP pipeline, along with the GRPO-A algorithm to encourage diverse editing strategies. Experimental results demonstrate that our method significantly outperforms current approaches on the AesRecon benchmark, achieving performance comparable to professional-grade tools such as Nano Banana Pro.

📝 Abstract

In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026.

Problem

Research questions and friction points this paper is trying to address.

Aesthetic Photo Reconstruction

structural flaws

aesthetic quality

image editing

photographic aesthetics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aesthetic Photo Reconstruction

Two-stage Framework

Aesthetic Action Planning