VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current evaluations of travel-planning agents are largely confined to structured API interactions, failing to capture critical challenges in open-web environments—such as information noise, conflicting facts across sources, and the interplay between multimodal perception and logical reasoning. This work proposes the first verifiable evaluation benchmark tailored to unstructured, multimodal web corpora, introducing a Multimodal Retrieval Bank (MRB) and a Verifiable Knowledge Base (VKB), along with a unit-level fact-checking mechanism to distinguish systematic reasoning errors from model hallucinations and to uncover the cognitive trade-offs between retrieval and reasoning. Evaluations on leading multimodal large language models reveal that autonomous retrieval significantly impairs instruction-following capabilities, highlighting the insufficient robustness and reliability of current agents in open-world settings.

📝 Abstract

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

Problem

Research questions and friction points this paper is trying to address.

travel planning agents

unstructured web corpora

multimodal reasoning

factual reliability

autonomous retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

verifiable benchmark

multimodal retrieval

autonomous agents