Binary BPE Tokenizers
softwareCross-platform Byte Pair Encoding tokenizers for binary executables, enabling 2-3× more efficient transformer-based binary analysis across ELF, PE, Mach-O, and APK formats
The first cross-platform Byte Pair Encoding (BPE) tokenizer family designed specifically for binary executables, trained on 24GB of diverse binaries spanning multiple platforms, architectures, and operating systems.
Research Impact
Binary BPE addresses a fundamental bottleneck in transformer-based binary analysis: raw byte-level tokenization wastes precious context window capacity. The tokenizer family achieves:
- 2-3× compression on typical uncompressed binaries (ELF/PE/Mach-O)
- 2.89 bytes/token average compression ratio
- 67.2% vocabulary utilization (43,945 of 65,536 tokens actively used)
- Cross-platform pattern learning without supervision
Publication
“Learning to Tokenize Binaries: Cross-Platform BPE for Transformer-Based Binary Analysis”
- Author: Michael J. Bommarito II
- Published: November 2025
- Available on arXiv (pending submission)
Tokenizer Family
The Binary BPE family includes five vocabulary sizes with a perfect nested hierarchy (each smaller vocabulary is a strict prefix of larger ones):
Vocabulary Sizes
- binary-tokenizer-001-4k: 4,096 tokens - Resource-constrained edge devices
- binary-tokenizer-001-8k: 8,192 tokens - Embedded malware scanners
- binary-tokenizer-001-16k: 16,384 tokens - Balanced research prototypes
- binary-tokenizer-001-32k: 32,768 tokens - Cloud-based analysis
- binary-tokenizer-001-64k: 65,536 tokens - Maximum compression for datacenters
All tokenizers available on Hugging Face.
Training Data
Trained on the Binary-30K dataset:
- 30,000 unique binaries totaling 24GB
- Platform coverage: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android APKs
- Architecture coverage: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
- File formats: ELF, PE, Mach-O, APK
- Includes malware samples from SOREL-20M and Malware Bazaar
Key Features
Unsupervised Pattern Discovery
Without explicit supervision, the tokenizer learns:
- File format structures: ELF magic numbers (\x7fELF), PE headers (MZ), Mach-O signatures
- Architecture-specific patterns: x86-64 REX prefixes, instruction encodings (27.5% of vocabulary)
- Cross-platform sequences: Null padding, ASCII library paths, dynamic linking strings
Platform Performance
| Platform | Bytes/Token | Vocab Coverage | Median Tokens |
|---|---|---|---|
| Linux | 2.95 | 71.3% | 52,341 |
| Windows | 2.87 | 68.9% | 48,127 |
| macOS | 2.67 | 65.4% | 41,892 |
| Android | 1.85 | 52.1% | 127,563 |
Vocabulary Size Scaling
| Vocab Size | Bytes/Token | Improvement | Tokens Used |
|---|---|---|---|
| 4K (2¹²) | 2.01 | baseline | 3,892 (95%) |
| 8K (2¹³) | 2.21 | +10.0% | 7,234 (88%) |
| 16K (2¹⁴) | 2.41 | +19.9% | 13,567 (83%) |
| 32K (2¹⁵) | 2.64 | +31.3% | 24,891 (76%) |
| 64K (2¹⁶) | 2.89 | +43.8% | 43,945 (67%) |
Technical Advantages
Nested Hierarchy
The tokenizer family exhibits perfect nesting: token ID 0-4095 in the 4K tokenizer maps to identical byte sequences in all larger vocabularies. This enables:
- Embedding transfer when scaling model capacity
- Progressive development from prototypes to production
- Cost-effective experimentation across resource constraints
Context Window Efficiency
At 2.6 bytes/token average:
- 8,192-token context → ~21KB of binary content
- 32,768-token context → ~84KB of binary content
- 128,000-token context → ~330KB of binary content
Format and Architecture Support
- File formats: ELF (Linux), PE (Windows), Mach-O (macOS), APK (Android)
- Architectures: x86, x86-64, ARM, ARM64, MIPS, RISC-V
- Handles: Stripped binaries, obfuscated code, mixed code/data sections
Applications
The Binary BPE family enables efficient transformer-based analysis for:
- Malware detection and classification
- Vulnerability discovery in proprietary software
- Binary similarity detection and plagiarism analysis
- Function-purpose identification in stripped binaries
- Reverse engineering assistance in tools like Ghidra and IDA Pro
- Binary optimization and dead code detection
Implementation
Rust Training Implementation
The bbpe Rust package provides:
- Chunk-based processing with entropy filtering
- 8KB chunk size with 7.0 bits/byte maximum entropy threshold
- HuggingFace-compatible tokenizer JSON output
- Open-source training pipeline for custom vocabularies
Available at: github.com/mjbommar/binary-bpe
Usage Example
from tokenizers import Tokenizer
# Load production tokenizer (64K vocabulary)
tokenizer = Tokenizer.from_file("tokenizer-65536.json")
# Tokenize binary data
with open("binary_file", "rb") as f:
binary_data = f.read()
tokens = tokenizer.encode(binary_data.hex())
print(f"Compressed {len(binary_data)} bytes → {len(tokens.ids)} tokens")
print(f"Compression ratio: {len(binary_data) / len(tokens.ids):.2f} bytes/token")
Availability
All resources are freely available:
- HuggingFace Models: mjbommar/binary-tokenizer-001-*
- GitHub Repository: Complete source code and training scripts
- Paper Repository: binary-tokenizer-paper
- Dataset: Binary-30K
Related Projects
- Binary-30K Dataset: Companion dataset with 30,000 pre-tokenized binaries
- bbpe: Rust implementation for training custom binary tokenizers
- Future work includes transformer baselines for malware classification and binary similarity
Impact on Binary Analysis
Binary BPE establishes that:
- Learned tokenization significantly outperforms raw bytes for sequence models
- Cross-platform patterns can be discovered without explicit programming
- Vocabulary scaling provides flexible deployment from edge to datacenter
- Open-source tokenizers enable reproducible research in binary analysis
The tokenizer family provides a drop-in foundation for binary-focused language models, enabling 3× more efficient transformer-based analysis compared to raw byte processing.