Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks

๐Ÿ“… 2024-04-06
๐Ÿ›๏ธ International Conference on Learning Representations
๐Ÿ“ˆ Citations: 7
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates evolutionary relationships among large language models (LLMs) and their performance predictability. To address this, we propose PhyloLMโ€”a novel framework that adapts phylogenetic inference from population genetics to LLM analysis. Without requiring access to training data, architectures, or optimization details, PhyloLM computes semantic similarity between model outputs on standardized prompts and constructs phylogenetic trees using neighbor-joining or UPGMA algorithms. Applied to 156 open- and closed-source LLMs, it successfully reconstructs evolutionary lineages for 111 open-source and 45 closed-source models, accurately recovering known derivation hierarchies. Crucially, phylogenetic distances exhibit strong correlation with benchmark performance (e.g., MMLU, BBH), achieving an average Pearson correlation coefficient of 0.72โ€”enabling low-cost cross-model capability estimation. Our core contribution is the introduction of phylogenetic modeling as a new paradigm for LLM analysis, enabling training-agnostic relationship inference and performance prediction.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper introduces PhyloLM, a method adapting phylogenetic algorithms to Large Language Models (LLMs) to explore whether and how they relate to each other and to predict their performance characteristics. Our method calculates a phylogenetic distance metrics based on the similarity of LLMs' output. The resulting metric is then used to construct dendrograms, which satisfactorily capture known relationships across a set of 111 open-source and 45 closed models. Furthermore, our phylogenetic distance predicts performance in standard benchmarks, thus demonstrating its functional validity and paving the way for a time and cost-effective estimation of LLM capabilities. To sum up, by translating population genetic concepts to machine learning, we propose and validate a tool to evaluate LLM development, relationships and capabilities, even in the absence of transparent training information.
Problem

Research questions and friction points this paper is trying to address.

Inferring evolutionary relationships among Large Language Models
Predicting LLM performance across standard benchmarks
Developing phylogenetic methods for LLM capability evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts phylogenetic algorithms to analyze LLMs
Calculates distance metric from model output similarity
Predicts benchmark performance using phylogenetic relationships
๐Ÿ”Ž Similar Papers
No similar papers found.