🤖 AI Summary
This paper addresses the challenge of reconstructing geometrically consistent and photorealistic gigapixel images from a minimal set of handheld smartphone photos—specifically, 1–3 globally low-resolution images augmented with local close-ups—under uncontrolled outdoor conditions. Methodologically, it introduces (i) an instance-level paired data construction strategy and (ii) a lightweight cross-scale image registration technique to enable robust scale estimation and degradation alignment across diverse materials and complex degradations. Furthermore, it proposes an end-to-end framework built upon an adapter-augmented pre-trained generative model, integrating sliding-window inference, explicit degradation modeling, and multi-scale consistency constraints. Experiments demonstrate substantial improvements in geometric fidelity and texture realism of super-resolved outputs, enabling seamless zooming and interactive billion-pixel browsing. The approach establishes a novel, cost-effective paradigm for ultra-high-resolution imaging.
📝 Abstract
We present UltraZoom, a system for generating gigapixel-resolution images of objects from casually captured inputs, such as handheld phone photos. Given a full-shot image (global, low-detail) and one or more close-ups (local, high-detail), UltraZoom upscales the full image to match the fine detail and scale of the close-up examples. To achieve this, we construct a per-instance paired dataset from the close-ups and adapt a pretrained generative model to learn object-specific low-to-high resolution mappings. At inference, we apply the model in a sliding window fashion over the full image. Constructing these pairs is non-trivial: it requires registering the close-ups within the full image for scale estimation and degradation alignment. We introduce a simple, robust method for getting registration on arbitrary materials in casual, in-the-wild captures. Together, these components form a system that enables seamless pan and zoom across the entire object, producing consistent, photorealistic gigapixel imagery from minimal input.