🤖 AI Summary
E-commerce query rewriting (QR) suffers from subjective intent evaluation and the absence of reliable automatic evaluation metrics. To address this, we propose a multi-LLM-agent-driven dynamic evolutionary optimization framework: a multi-agent system simulating realistic user shopping behavior generates interactive, fine-grained feedback—replacing static scoring models as reward signals—and integrates genetic algorithms to close the loop between iterative query generation and evaluation. Evaluated on 1,000 real-world e-commerce queries, our method improves over the original queries by an average of 21.98%, significantly outperforming the Best-of-N baseline by 3.36%. Our core contribution is the first integration of collaborative multi-agent feedback mechanisms with evolutionary search, enabling end-to-end, intent-driven, and fully learnable QR optimization.
📝 Abstract
Deploying capable and user-aligned LLM-based systems necessitates reliable evaluation. While LLMs excel in verifiable tasks like coding and mathematics, where gold-standard solutions are available, adoption remains challenging for subjective tasks that lack a single correct answer. E-commerce Query Rewriting (QR) is one such problem where determining whether a rewritten query properly captures the user intent is extremely difficult to figure out algorithmically. In this work, we introduce OptAgent, a novel framework that combines multi-agent simulations with genetic algorithms to verify and optimize queries for QR. Instead of relying on a static reward model or a single LLM judge, our approach uses multiple LLM-based agents, each acting as a simulated shopping customer, as a dynamic reward signal. The average of these agent-derived scores serves as an effective fitness function for an evolutionary algorithm that iteratively refines the user's initial query. We evaluate OptAgent on a dataset of 1000 real-world e-commerce queries in five different categories, and we observe an average improvement of 21.98% over the original user query and 3.36% over a Best-of-N LLM rewriting baseline.