Lean proved this program was correct; then I found a bug.
lean formal_verification security fuzzing
AI agents are getting very good at finding vulnerabilities in large-scale software systems.
Anthropic, was apparently so spooked by the vulnerability-discovery capabilities of Mythos, they decided not to release it as it was "too dangerous" (lol). Whether you believe the hype about these latest models or not, it seems undeniable that the writing is on the wall:
The cost of discovering security bugs is collapsing, and the vast majority of software running today was never built to withstand that kind of scrutiny. We are facing a looming software crisis.
In the face of this oncoming tsunami, recently there has been increasing interest in formal verification as a solution. If we state and prove properties about our code using a mechanical tool, can we build robust, secure and stable software that can overcome this oncoming barrage of attacks?
One recent development in the Lean ecosystem has taken steps towards this question. 10 agents autonomously built and proved an implementation of zlib, lean-zip, an impressive landmark result. Quoting from Leo De Moura, the chief architect of the Lean FRO (here):
With apologies for the AI-slop (Leo has a penchant for it, it seems),
the key result is that lean-zip is not just another implementation of
zlib. It is an implementation that has been verified as correct end to
end, guaranteed by Lean to be entirely free of implementation bugs.
What does "verified as correct" actually look like? Here is one of the main theorems (github):
theorem zlib_decompressSingle_compress (data : ByteArray) (level : UInt8)
(hsize : data.size < 1024 * 1024 * 1024) :
ZlibDecode.decompressSingle
(ZlibEncode.compress data level) = .ok data
For any byte array less than 1 gigabyte, calling
ZlibDecode.decompressSingle on the output of ZlibEncode.compress
produces the original data. The decompress function is exactly the
inverse of compression. This pair of functions is entirely correct.
Or is it?
I pointed a Claude agent at lean-zip over a weekend, armed with AFL++,
AddressSanitizer, Valgrind, and UBSan. Over 105 million fuzzing
executions, it found:
- Zero memory vulnerabilities in the verified Lean application code.
- A heap buffer overflow in the Lean 4 runtime (
lean_alloc_sarray), affecting every version of Lean to date. (bug report, pending fix) - A denial-of-service in
lean-zip's archive parser, which was never verified.
The setup
The setup for the experiment was quite simple. I took the lean-zip codebase and produced a stripped down version and pointed Claude at it.
In particular, as part of the setup: (1) I dropped all theorems and
specifications, (2) removed all markdown documentation, and (3)
stripped out lean-zip's C FFI bindings to zlib which it provided as
an alternative to its native implementation. What remained was
purely the verified code: the native Lean definitions for DEFLATE,
gzip, ZIP archive handling, and tar. Any bug found in this would
correspond to an error in the verified code.
The idea with dropping theorems and documentation was to avoid biasing the Claude agent by revealing that the code was actually verified – I figured if it knew the code "had no bugs" then it might pre-emptively give up, while operating in the blind might let it work through the software without bias.
With the lean implementation accessible through a CLI, I then spun up a server for the fuzzing experiments, pointed Claude at it, and let it go wild.
Results
Over the course of a night, Claude launched 16 parallel fuzzers across the 6 attack surfaces of the library: ZIP extract, gzip decompress, raw DEFLATE inflate, tar extract, tar.gz, and compression. It built separate binaries with AddressSanitizer and UndefinedBehaviorSanitizer, ran Valgrind memcheck, used cppcheck and flawfinder for static analysis, crafted 48 hand-written exploit files targeting known zlib CVE patterns.
Overall, this resulted in 105,823,818 fuzzing executions. 359 seed files. 16 fuzzers running for approximately 19 hours uncovering 4 crashing inputs, and 1 memory vulnerability in the code.
Bug 1: Heap buffer overflow in the Lean runtime
The most substantial finding was a heap buffer overflow! but, not in
lean-zip's code, but in the Lean runtime itself.
The vulnerable function is lean_alloc_sarray, which allocates all
scalar arrays (ByteArray, FloatArray, etc.) in Lean 4:
lean_obj_res lean_alloc_sarray(
unsigned elem_size, size_t size, size_t capacity
) {
lean_sarray_object * o =
(lean_sarray_object*)lean_alloc_object(
sizeof(lean_sarray_object) + elem_size * capacity
);
// ...
}
For a ByteArray of capacity n, the allocation size is 24 + n.
When n is close to SIZE_MAX (2^{64} - 1 on 64-bit systems), the
addition wraps around to a small number. The runtime allocates a tiny
buffer of around 23 bytes, but the caller proceeds to read n bytes
into it.
The overflow can be triggered through lean_io_prim_handle_read, the C
function backing IO.FS.Handle.read:
obj_res lean_io_prim_handle_read(
b_obj_arg h, usize nbytes
) {
FILE * fp = io_get_handle(h);
obj_res res =
lean_alloc_sarray(1, 0, nbytes); // overflows here
// ...
usize n = std::fread(
lean_sarray_cptr(res), 1, nbytes, fp
);
// ^^^ 23-byte buffer ^^^ SIZE_MAX count
A 156-byte crafted ZIP file with a ZIP64 compressedSize of
0xFFFFFFFFFFFFFFFF is sufficient to trigger it. The same pattern
exists in lean_io_get_random_bytes. The bug affects every version of
Lean 4 up to and including the latest nightly
(v4.31.0-nightly-2026-04-11). The minimal reproducer is 5 simple lines:
def main : IO Unit := do
IO.FS.writeFile "test.bin" "hello"
let h ← IO.FS.Handle.mk "test.bin" .read
let n : USize := (0 : USize) - (1 : USize) -- SIZE_MAX
let _ ← h.read n -- overflows in lean_alloc_sarray
Edit: there is a pending PR to lean to fix this.
Bug 2: Denial-of-service in the archive parser
AFL also found a denial-of-service in lean-zip's own code. The
readExact function in Archive.lean passes the compressedSize
field from the ZIP central directory straight to h.read without
validating it against the actual file size:
def readExact (h : IO.FS.Handle) (n : Nat) ... := do
-- ...
while buf.size < n do
let remaining := n - buf.size
let chunk ← h.read remaining.toUSize
-- n comes from the ZIP header
-- ...
A 156-byte ZIP claiming a compressedSize of several exabytes causes
the process to panic with INTERNAL PANIC: out of memory, as h.read
allocates more memory than available. This is indeed a bug: the system
unzip handles this gracefully, validating header sizes against the
file before allocating, while lean-zip does not and crashes with an
OOM.
Why verification didn't catch these bugs
The OOM denial-of-service is straightforward: the archive parser was
never verified. lean-zip's proofs cover the compression and
decompression pipeline (DEFLATE, Huffman, CRC32, roundtrip
correctness), but Archive.lean, the module that reads ZIP headers and
extracts files, has zero theorems even in the original unstripped
codebase. The compressedSize field is read from an untrusted header
and passed directly to an allocation without validation. The situation
is reminiscent of Yang et al.'s CSmith work (PLDI 2011), which found
that CompCert's verified optimisation passes had zero bugs while its
unverified front-end did. Verification works where it is applied. The
archive parser was where lean-zip was not verified.
The heap buffer overflow is more fundamental. lean_alloc_sarray is a
C++ function in the Lean runtime, part of the trusted computing
base. Every Lean proof assumes the runtime is correct. A bug here
does not just affect lean-zip. It affects every Lean 4 program that
allocates a ByteArray.
Conclusion
The positive result here is actually the remarkable one. Across 105 million executions, the application code (that is, excluding the runtime) had zero heap buffer overflows, zero use-after-free, zero stack buffer overflows, zero undefined behaviour (UBSan clean), and zero out-of-bounds array reads in the Lean-generated C code. To quote Claude's own assessment of the codebase (without knowing it was verified):
This is genuinely one of the most memory-safe codebases I've analyzed. The Lean type system with dependent types and well-founded recursion has eliminated entire classes of bugs that plague C/C++ zip implementations. The CVE classes that have plagued zlib for decades are structurally impossible in this codebase.
The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct (and now has a PR addressing).
Overall verification resulted in a remarkably robust and rigorous codebase. AFL and Claude had a really hard time finding errors. But they did still find issues. Verification is only as strong as the questions you think to ask and the foundations you choose to trust.
Quis custodiet ipsos custodes?