🤖 AI Summary
This work proposes the first modular multimodal large language model (LLM) agent framework tailored for urban change analysis, addressing the limitations of existing approaches that rely on single-modality inputs and rigid pipelines, which struggle to effectively integrate heterogeneous multisource data. The framework introduces a modality controller to enable dynamic intra- and cross-modal alignment, flexibly incorporating remote sensing imagery, nighttime light data, and textual information while mitigating LLM hallucinations. Evaluated on real-world urban case studies, the approach achieves a 46.7% improvement in task success rate over the strongest baseline, significantly enhancing semantic understanding, reasoning capabilities, and policy relevance. It successfully uncovers complex urban dynamics, including green space transformation in New York, water pollution diffusion in Hong Kong, and landfill evolution in Shenzhen.