Binary BPE Tokenizers | mike bommarito

The first cross-platform Byte Pair Encoding (BPE) tokenizer family designed specifically for binary executables, trained on 24GB of diverse binaries spanning multiple platforms, architectures, and operating systems.

Research Impact

Binary BPE addresses a fundamental bottleneck in transformer-based binary analysis: raw byte-level tokenization wastes precious context window capacity. The tokenizer family achieves:

2-3× compression on typical uncompressed binaries (ELF/PE/Mach-O)
2.89 bytes/token average compression ratio
67.2% vocabulary utilization (43,945 of 65,536 tokens actively used)
Cross-platform pattern learning without supervision

Publication

“Learning to Tokenize Binaries: Cross-Platform BPE for Transformer-Based Binary Analysis”

Author: Michael J. Bommarito II
Published: November 2025
Available on arXiv (pending submission)

Tokenizer Family

The Binary BPE family includes five vocabulary sizes with a perfect nested hierarchy (each smaller vocabulary is a strict prefix of larger ones):

Vocabulary Sizes

binary-tokenizer-001-4k: 4,096 tokens - Resource-constrained edge devices
binary-tokenizer-001-8k: 8,192 tokens - Embedded malware scanners
binary-tokenizer-001-16k: 16,384 tokens - Balanced research prototypes
binary-tokenizer-001-32k: 32,768 tokens - Cloud-based analysis
binary-tokenizer-001-64k: 65,536 tokens - Maximum compression for datacenters

All tokenizers available on Hugging Face.

Training Data

Trained on the Binary-30K dataset:

30,000 unique binaries totaling 24GB
Platform coverage: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android APKs
Architecture coverage: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
File formats: ELF, PE, Mach-O, APK
Includes malware samples from SOREL-20M and Malware Bazaar

Key Features

Unsupervised Pattern Discovery

Without explicit supervision, the tokenizer learns:

File format structures: ELF magic numbers (\x7fELF), PE headers (MZ), Mach-O signatures
Architecture-specific patterns: x86-64 REX prefixes, instruction encodings (27.5% of vocabulary)
Cross-platform sequences: Null padding, ASCII library paths, dynamic linking strings

Platform Performance

Platform	Bytes/Token	Vocab Coverage	Median Tokens
Linux	2.95	71.3%	52,341
Windows	2.87	68.9%	48,127
macOS	2.67	65.4%	41,892
Android	1.85	52.1%	127,563

Vocabulary Size Scaling

Vocab Size	Bytes/Token	Improvement	Tokens Used
4K (2¹²)	2.01	baseline	3,892 (95%)
8K (2¹³)	2.21	+10.0%	7,234 (88%)
16K (2¹⁴)	2.41	+19.9%	13,567 (83%)
32K (2¹⁵)	2.64	+31.3%	24,891 (76%)
64K (2¹⁶)	2.89	+43.8%	43,945 (67%)

Technical Advantages

Nested Hierarchy

The tokenizer family exhibits perfect nesting: token ID 0-4095 in the 4K tokenizer maps to identical byte sequences in all larger vocabularies. This enables:

Embedding transfer when scaling model capacity
Progressive development from prototypes to production
Cost-effective experimentation across resource constraints

Context Window Efficiency

At 2.6 bytes/token average:

8,192-token context → ~21KB of binary content
32,768-token context → ~84KB of binary content
128,000-token context → ~330KB of binary content

Format and Architecture Support

File formats: ELF (Linux), PE (Windows), Mach-O (macOS), APK (Android)
Architectures: x86, x86-64, ARM, ARM64, MIPS, RISC-V
Handles: Stripped binaries, obfuscated code, mixed code/data sections

Applications

The Binary BPE family enables efficient transformer-based analysis for:

Malware detection and classification
Vulnerability discovery in proprietary software
Binary similarity detection and plagiarism analysis
Function-purpose identification in stripped binaries
Reverse engineering assistance in tools like Ghidra and IDA Pro
Binary optimization and dead code detection

Implementation

Rust Training Implementation

The bbpe Rust package provides:

Chunk-based processing with entropy filtering
8KB chunk size with 7.0 bits/byte maximum entropy threshold
HuggingFace-compatible tokenizer JSON output
Open-source training pipeline for custom vocabularies

Available at: github.com/mjbommar/binary-bpe

Usage Example

from tokenizers import Tokenizer

# Load production tokenizer (64K vocabulary)
tokenizer = Tokenizer.from_file("tokenizer-65536.json")

# Tokenize binary data
with open("binary_file", "rb") as f:
    binary_data = f.read()

tokens = tokenizer.encode(binary_data.hex())
print(f"Compressed {len(binary_data)} bytes → {len(tokens.ids)} tokens")
print(f"Compression ratio: {len(binary_data) / len(tokens.ids):.2f} bytes/token")

Availability

All resources are freely available:

HuggingFace Models: mjbommar/binary-tokenizer-001-*
GitHub Repository: Complete source code and training scripts
Paper Repository: binary-tokenizer-paper
Dataset: Binary-30K

Binary-30K Dataset: Companion dataset with 30,000 pre-tokenized binaries
bbpe: Rust implementation for training custom binary tokenizers
Future work includes transformer baselines for malware classification and binary similarity

Impact on Binary Analysis

Binary BPE establishes that:

Learned tokenization significantly outperforms raw bytes for sequence models
Cross-platform patterns can be discovered without explicit programming
Vocabulary scaling provides flexible deployment from edge to datacenter
Open-source tokenizers enable reproducible research in binary analysis

The tokenizer family provides a drop-in foundation for binary-focused language models, enabling 3× more efficient transformer-based analysis compared to raw byte processing.