tokenizing raw executables for malware analysis with bbpe
on this page
| pretrained models: | mjbommar/binary-tokenizer-001-{4k,8k,16k,32k,64k} on HuggingFace |
| what: | bpe tokenizers for raw binary executables — ELF, PE, Mach-O, APK |
| compression: | 2.89 bytes/token (64K vocab), 3x context vs raw byte tokenization |
| use cases: | malware classification, binary similarity, function identification |
| license: | Apache-2.0 (code), CC-BY-4.0 (dataset) |
overview
bbpe applies byte pair encoding to raw binary executables — ELF, PE, Mach-O, and APK files. the “binary” in the name refers to compiled binaries, firmware, and malware samples, not binary classification.
the pretrained tokenizers are available on HuggingFace in five vocabulary sizes (4K through 64K) and can be loaded directly with the tokenizers python library. the 64K tokenizer achieves 2.89 bytes/token compression, giving transformers roughly 3x more context per token compared to raw byte-level tokenization.
the tokenizer family was trained on 30 GB of cross-platform binaries spanning 10 architectures. all tokenizers form a nested prefix hierarchy — the 4K vocabulary is a strict prefix of the 8K, which is a prefix of the 16K, and so on — enabling seamless embedding transfer when scaling models.
when to use
- tokenizing executables for transformer-based malware classification
- binary similarity search across platforms and architectures
- function boundary or purpose identification in stripped binaries
- any ML pipeline that needs fixed-vocabulary representations of compiled code
using the pretrained tokenizers
install
uv pip install tokenizers load from huggingface
from tokenizers import Tokenizer
# load the 64K production tokenizer (recommended)
tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")
# or any other size in the family
tokenizer_32k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-32k")
tokenizer_16k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-16k")
tokenizer_8k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-8k")
tokenizer_4k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-4k") tokenize a binary
the tokenizer operates on latin-1 decoded strings, which preserves a 1:1 byte mapping:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")
# read a binary file and decode as latin-1 (preserves all byte values)
with open("/usr/bin/ls", "rb") as f:
binary_data = f.read()
text = binary_data.decode("latin-1")
encoding = tokenizer.encode(text)
print(f"Original size: {len(binary_data):,} bytes")
print(f"Token count: {len(encoding.ids):,} tokens")
print(f"Compression: {len(binary_data) / len(encoding.ids):.2f} bytes/token")
print(f"First 10 token IDs: {encoding.ids[:10]}") decode back to bytes
# decode token ids back to latin-1 string, then to bytes
decoded_text = tokenizer.decode(encoding.ids)
decoded_bytes = decoded_text.encode("latin-1")
assert decoded_bytes == binary_data # perfect reconstruction inspect individual tokens
# look up what each token represents
for i, token_id in enumerate(encoding.ids[:5]):
token_str = tokenizer.id_to_token(token_id)
token_bytes = token_str.encode("latin-1")
print(f"Token {i}: ID={token_id}, hex={token_bytes.hex()}, len={len(token_bytes)}") compare vocabulary sizes on a file
from tokenizers import Tokenizer
sizes = ["4k", "8k", "16k", "32k", "64k"]
binary_data = open("/usr/bin/grep", "rb").read()
text = binary_data.decode("latin-1")
print(f"File size: {len(binary_data):,} bytes\n")
for size in sizes:
tok = Tokenizer.from_pretrained(f"mjbommar/binary-tokenizer-001-{size}")
enc = tok.encode(text)
bpt = len(binary_data) / len(enc.ids)
print(f" {size:>3s}: {bpt:.3f} bytes/token ({len(enc.ids):,} tokens)") batch processing
from tokenizers import Tokenizer
from pathlib import Path
tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")
results = []
for path in Path("/corpus").glob("**/*"):
if not path.is_file():
continue
data = path.read_bytes()
tokens = tokenizer.encode(data.decode("latin-1"))
results.append({
"file": path.name,
"bytes": len(data),
"tokens": len(tokens.ids),
"bytes_per_token": len(data) / len(tokens.ids),
})
# sort by compression ratio
for r in sorted(results, key=lambda x: x["bytes_per_token"], reverse=True):
print(f"{r['file']:30s} {r['bytes_per_token']:.3f} bytes/token") special tokens
the tokenizers include standard transformer special tokens for downstream model training:
| token | ID (64K) | purpose |
|---|---|---|
<|start|> | 65529 | sequence start |
<|end|> | 65530 | sequence end |
<|pad|> | 65531 | padding |
<|unk|> | 65532 | unknown token |
<|cls|> | 65533 | classification |
<|sep|> | 65534 | separator |
<|mask|> | 65535 | masking |
pretrained model details
tokenizer family
all five tokenizers form a nested prefix hierarchy trained on 30 GB of cross-platform binaries:
| vocab size | bytes/token | when to use |
|---|---|---|
| 4K | 2.01 | very constrained context windows |
| 8K | 2.21 | limited compute resources |
| 16K | 2.41 | moderate compression |
| 32K | 2.64 | good balance of compression and vocab size |
| 64K | 2.89 | production (recommended) |
compression by platform
measured on the binary-30k dataset:
| platform | 4K | 8K | 16K | 32K | 64K |
|---|---|---|---|---|---|
| linux | 1.625 | 1.769 | 1.937 | 2.139 | 2.385 |
| windows | 2.119 | 2.341 | 2.595 | 2.905 | 3.274 |
| macos | 2.183 | 2.333 | 2.542 | 2.806 | 3.058 |
| android | 1.074 | 1.102 | 1.152 | 1.242 | 1.411 |
| overall | 1.766 | 1.925 | 2.112 | 2.341 | 2.613 |
windows binaries compress more efficiently due to the regularity of PE format structures. android APKs compress less because they contain compressed data (dex, zip) that resists BPE merges.
what the tokenizer learns
the 64K vocabulary breaks down roughly as:
- 27.5% instruction patterns (x86/ARM opcodes and common instruction sequences)
- 12.5% readable strings (ASCII text from symbol tables, error messages)
- 7.5% high-byte patterns (non-ASCII data structures)
- 52% mixed format structures (headers, metadata, padding)
binary-30k dataset
mjbommar/binary-30k is a companion dataset of 29,793 cross-platform binaries (10.8 GB, CC-BY-4.0):
- 73% benign, 27% malware
- windows 57%, linux 28%, macos 2%, android 1%, other 12%
- 10 architectures: x86-64, x86, ARM64, ARMv7, MIPS64, MIPS32, PowerPC64, RISC-V64, s390x
- 70/15/15 train/validation/test split with 4-dimensional stratification
training your own tokenizer
if you need to train on a custom corpus (domain-specific firmware, IoT devices, etc.), the bbpe rust crate provides a CLI:
cargo install bbpe train on binary files
bbpe train firmware.bin --vocab-size 4096 --min-frequency 4 \
--preprocessor null-delimited -o tokenizer.json train on JSONL corpora
bbpe train --jsonl corpus.jsonl:text --vocab-size 8192 \
--preprocessor ascii-whitespace --preprocessor-probability 0.75 \
-o tokenizer.json key training features
- entropy filtering:
--min-entropy/--max-entropyto skip compressed or encrypted regions - hierarchical families: smaller vocabs are strict prefixes of larger ones
- streaming trainer: bounded memory for large corpora
- JSONL support: dot-notation field paths with gzip decompression
related projects
- glaurung — binary analysis framework with first-class AI integration
- binary-tokenizer-paper — paper reproduction code, analysis scripts, and sample binaries
- ensemble-bpe-paper — ensemble tokenization experiments
references
- Bommarito, M.J. “Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis.” arXiv:2511.17573, November 2025. https://arxiv.org/abs/2511.17573
- Bommarito, M.J. “Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection.” arXiv:2511.22095, November 2025. https://arxiv.org/abs/2511.22095
- bbpe crate on crates.io: https://crates.io/crates/bbpe
- Pretrained tokenizers on HuggingFace: https://huggingface.co/mjbommar