🤖 AI Summary
This work addresses the core problem of translating human semantic instructions into professional-grade camera composition. Methodologically, it introduces a natural language–driven, fully automated photography system based on a reference-image-guided language–vision co-reasoning paradigm: given a textual instruction, the system retrieves semantically matched reference images; aligns them with the live scene via a vision-language model (VLM) and pre-trained ViT features; and performs end-to-end composition adjustment using object detection, PnP pose estimation, and RGB-D robotic arm execution. Its key contribution is the first cross-modal semantic composition transfer mechanism, enabling zero-shot generalization (e.g., from paintings to real-world photography) without manual annotations or fine-tuning. User studies show that photographs generated by the system achieve significantly higher aesthetic scores than those taken by average users. Extensive evaluation on a physical robot platform demonstrates the system’s effectiveness and robustness in real-world deployment.
📝 Abstract
We introduce PhotoBot, a framework for fully automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via reference images that are selected from a curated gallery. We leverage a visual language model (VLM) and an object detector to characterize the reference images via textual descriptions and then use a large language model (LLM) to retrieve relevant reference images based on a user’s language query through text-based reasoning. To correspond the reference image and the observed scene, we exploit pretrained features from a vision transformer capable of capturing semantic similarity across marked appearance variations. Using these features, we compute suggested pose adjustments for an RGB-D camera by solving a perspective-n-point (PnP) problem. We demonstrate our approach using a manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback. We also show that PhotoBot can generalize to other reference sources such as paintings.