🤖 AI Summary
To address the challenge ordinary users face in composing aesthetically pleasing photographs, this paper proposes the first multimodal composition optimization framework, which decomposes composition adjustment into three hierarchical subtasks: translation, scaling, and viewpoint transformation. To support training, we construct a large-scale synthetic dataset via a two-stage pipeline that integrates multi-view images with controllable degradation models to generate instruction–example–target triplets. Our framework is built upon a multimodal large language model and trained via end-to-end fine-tuning. Experiments demonstrate that joint guidance from text instructions and example images significantly outperforms example-only baselines, yielding a +12.3% improvement in composition quality and effectively enhancing user compositional capability. To foster reproducibility and further research, we fully open-source the model, code, and dataset.
📝 Abstract
Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users. Codes, model weights, and datasets have been released in https://zhiyuanyou.github.io/photoframer.