🤖 AI Summary
The rapid proliferation of large language models (LLMs) has outpaced systematic tracking of their evolutionary relationships—e.g., via fine-tuning, distillation, or adaptation—resulting in unrecorded, non-reproducible lineages. Existing analysis methods suffer from task specificity, reliance on fixed model sets, or assumptions about architecture or tokenization. Method: We propose LLM DNA—a low-dimensional, bi-Lipschitz functional behavioral representation—defining the first heritable, genetically deterministic behavioral fingerprint for LLMs, agnostic to architecture and tokenizer. Our training-free DNA extraction pipeline, combined with phylogenetic inference, constructs the first comprehensive LLM phylogeny spanning 305 models. Contribution/Results: Experiments demonstrate that LLM DNA accurately recovers known lineage relationships, discovers previously unknown evolutionary paths, and quantitatively captures architectural shifts, temporal trends, and divergent family-level evolutionary rates.
📝 Abstract
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.