🤖 AI Summary
This work addresses the challenge of translating high-level linguistic instructions into camera control for embodied agents in a manner that satisfies both spatial plausibility and aesthetic quality. The authors propose a novel approach that integrates chain-of-thought reasoning with a differentiable internal world model. Specifically, a large multimodal language model interprets subjective aesthetic goals into geometric constraints, which are then combined with an analytical solver and a vision-based reflex mechanism built upon 3D Gaussian Splatting (3DGS) to enable iterative optimization without physical trial-and-error. By uniquely coupling mental simulation with chain-of-thought reasoning, the method achieves state-of-the-art performance in both spatial understanding and image aesthetics, while demonstrating rapid convergence and high-fidelity visual output.
📝 Abstract
Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.