s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Current large language models’ performance on competition mathematics poorly reflects their capacity to formally verify real-world, industrial-grade low-level code, such as cryptographic assembly. To address this gap, this work introduces the first machine-checkable proof synthesis benchmark targeting industrial-scale low-level code, built upon AWS’s s2n-bignum cryptographic assembly library, which has already been formally verified in HOL Light. The benchmark requires models to generate proof scripts that adhere to formal specifications and are accepted by HOL Light within a fixed time budget. By focusing specifically on cryptographic assembly from a deployed system, this benchmark fills a critical void in evaluating LLMs on practical formal verification tasks, offering a reproducible, challenging, and realistically grounded evaluation platform.

Technology Category

Application Category

📝 Abstract

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.

Problem

Research questions and friction points this paper is trying to address.

formal verification

low-level code

LLM reasoning

cryptographic assembly

proof synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

s2n-bignum-bench

large language models

formal verification