🤖 AI Summary
In architectural design, conventional text-based retrieval fails to capture visual semantics and complex spatial relationships, resulting in inefficient and inaccurate case retrieval. This paper introduces the first fine-grained vision-language cross-modal retrieval framework tailored for architecture, integrating a multi-scale vision-language model (an enhanced CLIP variant), cross-modal embedding alignment, query refinement, and interactive feedback learning. The framework supports dual-modality queries (text or image) and delivers interpretable design inspiration recommendations. Its key innovations include modeling design intent within cross-modal alignment and enabling user-driven iterative optimization. Evaluated with professional architects, the method reduces average retrieval time by 62% and achieves an 89.3% Top-5 retrieval accuracy—significantly improving both efficiency and relevance in architectural case acquisition.
📝 Abstract
Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.