🤖 AI Summary
Traditional travel search engines rely on structured inputs and struggle to support hotel retrieval via natural language queries.
Method: This paper proposes a domain-specific multimodal dense retrieval model for travel. It introduces (1) a novel three-task joint optimization framework tailored to the travel domain; (2) an asymmetric cross-modal retrieval architecture that synergizes a small language model (for online query encoding) with a large language model (for offline hotel representation); and (3) full-gallery-level image feature aggregation and alignment.
Results: Evaluated on four benchmark datasets, the model achieves Recall@10 of 0.681 on primary query types—outperforming the state-of-the-art MARVEL by 12.9% and significantly surpassing baselines such as VISTA.
📝 Abstract
We present HotelMatch-LLM, a multimodal dense retrieval model for the travel domain that enables natural language property search, addressing the limitations of traditional travel search engines which require users to start with a destination and editing search parameters. HotelMatch-LLM features three key innovations: (1) Domain-specific multi-task optimization with three novel retrieval, visual, and language modeling objectives; (2) Asymmetrical dense retrieval architecture combining a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedding hotel data; and (3) Extensive image processing to handle all property image galleries. Experiments on four diverse test sets show HotelMatch-LLM significantly outperforms state-of-the-art models, including VISTA and MARVEL. Specifically, on the test set -- main query type -- we achieve 0.681 for HotelMatch-LLM compared to 0.603 for the most effective baseline, MARVEL. Our analysis highlights the impact of our multi-task optimization, the generalizability of HotelMatch-LLM across LLM architectures, and its scalability for processing large image galleries.