🤖 AI Summary
Existing image simplification methods often rely on non-photorealistic rendering, struggling to balance visual abstraction with photorealistic fidelity. This work proposes a progressive semantic image simplification framework that iteratively reduces scene complexity through a sequence of selection, removal, and verification steps while preserving photometric realism. For the first time, it enables controllable, semantics-driven progressive simplification: a vision-language model identifies and ranks content elements by importance; generative editing combined with a learned validator ensures perceptual realism; and knowledge distillation yields an end-to-end image-to-video simplification model. The resulting simplification sequences are visually coherent and naturally support applications such as content-aware decluttering, semantic layering, and interactive editing.
📝 Abstract
Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.