🤖 AI Summary
This study investigates the pronounced performance degradation of large language models (LLMs) when processing multi-instance inputs, a phenomenon that intensifies with increasing instance count—termed “performance collapse.” Through systematic evaluation of mainstream LLMs under controlled variations of context length and instance number, the work identifies instance quantity as the primary driver of performance decline, exerting a stronger effect than context length alone. Empirical results reveal that model performance begins to mildly deteriorate at 20–100 instances and subsequently undergoes sharp deterioration at larger scales. These findings offer novel empirical insights and a foundational understanding for advancing multi-instance reasoning mechanisms and guiding future model optimization strategies.
📝 Abstract
Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.