ctrl-F-ing around: how glaurung autonomously discovered a heap overflow in notepad.exe

note: this is a teaching write-up of a workflow, not a vulnerability disclosure. the bug here is a self-inflicted crash, not a security hole — more on why at the end.

i’ve spent the better part of six months building glaurung, a binary-analysis toolkit, around a single bet: that a language model, kept honest by ground truth, could collapse the kind of reverse-engineering that used to take an expert a focused week into something you could do casually — almost distractedly.

this morning i tested that bet, more or less by accident, in about an hour — in between rounds of fortnite squads with my kids, the game on one screen and glaurung on another. in that hour it lifted all 523 functions of notepad.exe, an llm pass and two off-the-shelf scanners surfaced a candidate on their own, the disassembler confirmed it was real, and i reproduced it live on a currently-shipping windows binary: a 32-bit integer overflow in Replace All that wraps a buffer size and smashes the heap.

then i spent the last ten minutes establishing that microsoft neither will nor should fix it. that part matters just as much.

the gap between those two numbers — six months of building, sixty distracted minutes of using — is the actual story. the bug is just what fell out. so this is partly about a notepad crash, but mostly about what the six months bought, and the question that drove them: what does it take to point an llm at a binary without fooling yourself?

the cheap part: lifting the whole binary

the first thing worth internalizing is that decompiling an entire binary is now effectively free.

glaurung’s decompiler runs the usual pipeline — cfg discovery, per-function lifting, ssa, structural analysis, ast lowering, expression reconstruction — and when a pdb is available it folds in symbol names, win32 api prototypes, and a first-cut type-recovery pass. one call lifts everything:

import glaurung as g
res = g.ir.decompile_all("notepad.exe", pdb_cache="...")
# 523 functions in 1.9s (4 ms/function), 466 of them pdb-named

two seconds. and what comes out is structured pseudo-c, not a register dump. here is the encoding-detection routine exactly as decompile_all emits it — no llm involved yet:

// PDB: ?IsTextUTF8@@YAHPEBDH@Z
fn ?IsTextUTF8@@YAHPEBDH@Z {
    push(var0);
    rsp = (rsp - 48);
    var1 = 0;
    MultiByteToWideChar(0xfde9, (var0 + 8), arg0, arg1); // proto: int32_t MultiByteToWideChar(uint32_t CodePage, ... dwFlags, PSTR, int32_t, PWSTR, int32_t)
    t0 = ret;
    if ((t0 != 0)) { var1 = 1; ret = var1; rsp = (rsp + 48); pop(var0); return; }
    GetLastError(); // proto: WIN32_ERROR GetLastError(void)
    if ((ret == 1113)) { ret = var1; rsp = (rsp + 48); pop(var0); return; }
    var1 = 1; ret = var1; rsp = (rsp + 48); pop(var0); return;
}

it still wears its lifted-ir scar tissue: var0/var1 locals, an explicit rsp = rsp - 48, ret used as a return-value temp, the push/pop, and — note this — the dwFlags argument mis-rendered as (var0 + 8), which is just wrong. the lifter is not magic.

but look at what the pdb and the lifter already recovered, for free: the real function name, the win32 prototypes inlined, and the magic constants left intact. 0xfde9 is CP_UTF8; 1113 is ERROR_NO_UNICODE_TRANSLATION. even raw, the intent is legible — try to read the bytes as strict utf-8, treat the specific invalid-sequence error as “not utf-8.” that’s a navigation aid you can actually navigate with.

if the static lift is two seconds, the move is to never think about whether to do it. lift everything, cache it, and spend the expensive budget — human attention, or an llm’s — selectively.

the part that fools people: the llm pass

the next tier is to hand each lifted function to an llm and ask for a cleaned-up version with meaningful names, a one-line summary, and any security concerns. for notepad: 523 functions, ~90 seconds, about 791k tokens on a small model. pennies.

the same IsTextUTF8 comes back as something you’d accept in review:

int IsTextUTF8(const char *input, int length) {
    int converted = MultiByteToWideChar(0xFDE9, (DWORD)(uintptr_t)(input + 8), input, length);
    if (converted != 0) return 1;
    if (GetLastError() == 1113) return 0;
    return 1;
}

scars gone, control flow collapsed, readable. but look closely: it carried the lifter’s bogus (input + 8) flags argument straight through. to its credit the model’s notes flagged that argument as “likely a misinterpreted parameter” — and then left it in the code anyway.

the results read beautifully. and that is exactly the trap.

the lifted c is a paraphrase of a paraphrase. the decompiler already guesses; the llm then guesses about the guess.

i ran two off-the-shelf source scanners (semgrep and a dangerous-c-functions pass) over the enhanced c. they parsed it, they ran, and they flagged 14 memcpy sites and a pile of “no visible bounds check” notes — 361 of 523 functions got some security caveat. taken at face value, that number is pure noise. it is the same sub-1%-true-positive regime you get from pointing an llm at source and saying “find bugs.”

what makes the difference is a single rule, applied without exception:

a hit on the lifted c is a candidate. ground-truth disassembly is the verdict.

glaurung keeps the capstone disassembly next to the lifted c precisely so you can confirm a claim at the instruction level before you believe it. in practice this rule cuts both ways, which is the part worth seeing.

the scanners (and the llm) flagged a wide-string helper, make_unique_string_nothrow, for a char_count * 2 allocation that looked unguarded in the paraphrase. the disassembly refused the claim:

cmp rsi, -1            ; reject SIZE_MAX
lea r15, [rsi*2 + 2]   ; size = count*2 + 2, in 64-bit
call <alloc>

the math is 64-bit and there is a sentinel check. it cannot overflow for any realistic count. the paraphrase invented a bug; the disassembly killed it. if you ship verdicts from the lifted c, you ship that false positive with confidence.

the part that was real

a sibling pattern did survive. notepad’s Replace All (?HeapBufferReplaceAll@@) sizes its new buffer like this, at ground truth:

imul eax, ebp          ; eax = replacement_len * match_count  (32-bit)
lea  eax, [rax*2 + 2]  ; *2 + 2 for wide chars
test eax, eax
jle  <bail>            ; the ONLY guard: rejects <= 0, not a positive wrap
call <GlobalAlloc>      ; uBytes = the (possibly wrapped) value

the replacement length is capped at 128 (wcsnlen(buf, 0x80)). the match count comes from the document and is accumulated into a 32-bit register with no cap. the product is a 32-bit imul. the only sanity check is jle, which catches a value that wrapped to negative — but not one that wrapped to a small positive number. and the copy loop afterward runs match_count times with no check of the write cursor against the allocation.

so: make replacement_len * match_count exceed 2^31 and you get an undersized allocation followed by a copy that writes the full logical size into it. classic cwe-190 into cwe-787.

before touching a vm, i confirmed the mechanism by emulating the real function bytes (unicorn, no windows): with 16,777,216 matches and a 128-byte replacement, the function asks GlobalAlloc for 2 bytes to hold what logically needs 4.29 gb, then the copy faults two bytes in. that is the bug, demonstrated on the actual instructions rather than the paraphrase.

proving it on the real thing

emulation proves the function; it does not prove the product. so i booted the matching windows image in qemu — same notepad.exe, sha-for-sha, version 10.0.26100 (currently shipping) — and drove the real Replace dialog programmatically.

(an aside that cost me an hour: a gui app launched over ssh lands in a non-interactive window station and never gets a visible window. the fix is to run the driver as an interactive scheduled task in the logged-on console session. file that away for any windows gui automation.)

the result, and the part that makes it evidence rather than anecdote:

a 16,777,216-character file + a 128-character replacement + Replace All → notepad dies instantly. reproduced repeatedly.
the identical operation on a 1,000-match file → no crash. notepad finishes the replace and carries on.

same code path, same keystrokes; only the match count differs. that differential is what ties the crash to the integer overflow rather than to Replace All in general. find, prove, reproduce — all the way to a live binary.

the twist: it is not a vulnerability

here is where most write-ups would cue the disclosure timeline. instead, the honest conclusion is that this should never be disclosed, and walking through why is more useful than the bug.

microsoft’s security servicing criteria service flaws that cross a security boundary and that an attacker can actually drive. this clears neither bar:

no boundary is crossed. the corruption happens in your own notepad, running as you. the best case is code execution at the privilege you already have; the likely case is just a crash. nothing is gained that you could not already do.
the trigger is not attacker-controlled. the overflow needs replacement_len * match_count ≥ 2^31. the attacker controls the file (the match count); the replacement string is whatever the victim types into the Replace box. a one-character replacement would need a ~4 gb file you can’t open. there is no path where opening someone’s .txt triggers this — it requires the victim to choose, by hand, to replace-all with a long string.

contrast the real notepad rce from earlier this year, cve-2026-20841: that one is the modern store notepad, where the attacker controls the entire markdown payload and a single click launches code. attacker-driven, recognized vector, boundary crossed. ours is the classic notepad.exe, half attacker-controlled, no boundary. different universe.

so: a genuine memory-corruption bug, root-caused, emulated, and reproduced live on shipping software — and correctly a reliability defect, not a security one. the right move is to not file it.

the lesson i’m keeping is about ordering. “can i reproduce it” is seductive and i chased it through emulation and a vm before asking the cheaper, more important question: “does this cross a boundary, and can an attacker actually drive it?” that test costs thirty seconds and it should come first. reproduction is necessary for disclosure; it is nowhere near sufficient.

what this is actually about

the notepad bug is a vehicle. the thing i care about is the shape of the workflow, because i think it generalizes:

lift everything — it’s two seconds, so it’s never a decision.
let cheap tools rank — llms, semgrep, grep, whatever. they are candidate generators, and they are noisy, and that’s fine.
confirm on ground truth — the disassembly is the authority. it confirms the real ones and, just as importantly, refutes the plausible-sounding fakes.
reproduce — emulate the bytes, then drive the live binary, with a differential control so the result means something.
apply the boundary test before you celebrate — ideally before step 4.

llms are very good at the noisy, generative middle and very bad at being trusted. glaurung’s job is to keep them honest — to make the ground truth cheap enough to check every claim against. the six months went into making each of those five steps boring and fast; the payoff is that a saturday-morning hour, run at maybe sixty percent attention, gets you from a shipping binary to a confirmed, reproduced finding and a defensible verdict on whether it matters.

that last clause is the one that’s hard to automate, and the one that counts.

ps: yes, your notepad will survive 16 million find-and-replaces. it just won’t enjoy the 16-million-and-first.