🤖 AI Summary
To address the critical performance bottleneck in PayPal’s business agent—excessive retrieval-component latency (>50% of end-to-end latency)—this work designs an e-commerce–oriented production-grade multi-agent system. It pioneers the application of NVIDIA NeMo to retrieval optimization, leveraging the Nemotron-8B small language model and introducing a retrieval-specific LoRA fine-tuning strategy, integrated with AdamW optimization, learning-rate scanning, and cosine annealing scheduling. Experiments demonstrate >50% reduction in retrieval latency, substantial decreases in overall inference latency and computational cost, while maintaining or improving task accuracy and user experience. Key contributions include: (1) the first industrial deployment of NeMo in an e-commerce multi-agent system; (2) a retrieval-oriented, lightweight LLM fine-tuning paradigm; and (3) a scalable, low-latency, cost-efficient, and high-quality commercial agent architecture.
📝 Abstract
We present the development and optimization of PayPal's Commerce Agent, powered by NEMO-4-PAYPAL, a multi-agent system designed to revolutionize agentic commerce on the PayPal platform. Through our strategic partnership with NVIDIA, we leveraged the NeMo Framework for LLM model fine-tuning to enhance agent performance. Specifically, we optimized the Search and Discovery agent by replacing our base model with a fine-tuned Nemotron small language model (SLM).
We conducted comprehensive experiments using the llama3.1-nemotron-nano-8B-v1 architecture, training LoRA-based models through systematic hyperparameter sweeps across learning rates, optimizers (Adam, AdamW), cosine annealing schedules, and LoRA ranks. Our contributions include: (1) the first application of NVIDIA's NeMo Framework to commerce-specific agent optimization, (2) LLM powered fine-tuning strategy for retrieval-focused commerce tasks, (3) demonstration of significant improvements in latency and cost while maintaining agent quality, and (4) a scalable framework for multi-agent system optimization in production e-commerce environments. Our results demonstrate that the fine-tuned Nemotron SLM effectively resolves the key performance issue in the retrieval component, which represents over 50% of total agent response time, while maintaining or enhancing overall system performance.