Cropper: Vision-Language Model for Image Cropping through In-Context Learning

📅 2024-08-14

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Conventional image cropping methods rely on dataset-specific training, exhibiting poor generalization; meanwhile, existing large vision-language models (VLMs) lack effective adaptation mechanisms for fine-grained visual tasks like cropping. Method: This paper proposes a fine-tuning-free VLM-based in-context learning framework for cropping. It leverages task-aware prompt retrieval to automatically select relevant contextual examples and introduces an iterative cropping optimization strategy that jointly incorporates subject awareness, aspect-ratio constraints, and flexible cropping requirements. Contribution/Results: The framework achieves state-of-the-art performance across multiple benchmarks without parameter updates. A user study confirms its superiority in both visual appeal and practical utility compared to prior methods.

Technology Category

Application Category

📝 Abstract

The goal of image cropping is to identify visually appealing crops within an image. Conventional methods rely on specialized architectures trained on specific datasets, which struggle to be adapted to new requirements. Recent breakthroughs in large vision-language models (VLMs) have enabled visual in-context learning without explicit training. However, effective strategies for vision downstream tasks with VLMs remain largely unclear and underexplored. In this paper, we propose an effective approach to leverage VLMs for better image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, named Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments and a user study demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adapting VLMs for image cropping without explicit training

Automating in-context example selection for cropping tasks

Improving crop predictions through iterative refinement strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient prompt retrieval for in-context examples

Iterative refinement strategy for crop enhancement

Leveraging vision-language models for diverse cropping tasks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs