🤖 AI Summary
This work proposes an AI-augmented scientific framework for learning-to-rank research, addressing the field’s heavy reliance on manual effort and lack of automated discovery pipelines. For the first time in ranking systems, we integrate a single LLM agent to execute routine tasks with a multi-LLM consensus mechanism—leveraging GPT-5.2, Gemini Pro 3, and Claude Opus 4.5—for creative ideation and result interpretation. Coupled with cloud-based GPU training orchestration, our system enables end-to-end automation from hypothesis generation and code implementation to model training. The framework autonomously discovers a novel sequential feature processing architecture that significantly outperforms baseline methods on offline metrics, achieving performance comparable to human expert designs while substantially reducing repetitive research labor.
📝 Abstract
Recent advances in AI agents for software engineering and scientific discovery have demonstrated remarkable capabilities, yet their application to developing novel ranking models in commercial search engines remains unexplored. In this paper, we present an AI Co-Scientist framework that automates the full search ranking research pipeline: from idea generation to code implementation and GPU training job scheduling with expert in the loop. Our approach strategically employs single-LLM agents for routine tasks while leveraging multi-LLM consensus agents (GPT 5.2, Gemini Pro 3, and Claude Opus 4.5) for challenging phases such as results analysis and idea generation. To our knowledge, this is the first study in the ranking community to utilize an AI Co-Scientist framework for algorithmic research. We demonstrate that this framework discovered a novel technique for handling sequence features, with all model enhancements produced automatically, yielding substantial offline performance improvements. Our findings suggest that AI systems can discover ranking architectures comparable to those developed by human experts while significantly reducing routine research workloads.