Benchmarking Mythos-Linked Bug Rediscovery

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study systematically evaluates whether large language models can rediscover six real-world, system-level vulnerabilities disclosed by the Anthropic Mythos project using only read-only source code, with critical metadata such as CVE identifiers explicitly withheld. Experiments were conducted with GPT-5.5 xhigh, Claude Opus 4.7, and Kimi K2 under a unified matching scoring protocol and replicated trial design, targeting file-level vulnerability rediscovery. Results show that GPT-5.5 xhigh successfully reproduced five instances (spanning two distinct tasks), Claude Opus 4.7 succeeded once, and Kimi K2 achieved no successes. The primary cause of failure across models was premature fixation on plausible but incorrect code locations. This work represents the first systematic assessment of large language models’ precise vulnerability localization capabilities in complex, real-world scenarios without reliance on key prompting cues.

📝 Abstract

Anthropic's April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6 tasks. Kimi K2 records 0/18 target rediscoveries. The dominant failure mode is early commitment to plausible alternate candidates within the assigned file: models often submit source-grounded hypotheses while missing the specific invariant corrected by public Mythos patch evidence. These results do not refute Anthropic's undisclosed workflow, but show that under this favorable target-file scaffold, systems-specific prompting yields only six target matches across 54 counted attempts.

Problem

Research questions and friction points this paper is trying to address.

bug rediscovery

Mythos-linked systems

large language models

vulnerability detection

controlled benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

bug rediscovery

controlled benchmarking

large language models