on this page

Binary BPE Tokenizers

software

Cross-platform Byte Pair Encoding tokenizers for binary executables, enabling 2-3× more efficient transformer-based binary analysis across ELF, PE, Mach-O, and APK formats

period: 2025-present
tech:
Machine LearningBinary AnalysisNatural Language ProcessingMalware Detection

The first cross-platform Byte Pair Encoding (BPE) tokenizer family designed specifically for binary executables, trained on 24GB of diverse binaries spanning multiple platforms, architectures, and operating systems.

Research Impact

Binary BPE addresses a fundamental bottleneck in transformer-based binary analysis: raw byte-level tokenization wastes precious context window capacity. The tokenizer family achieves:

  • 2-3× compression on typical uncompressed binaries (ELF/PE/Mach-O)
  • 2.89 bytes/token average compression ratio
  • 67.2% vocabulary utilization (43,945 of 65,536 tokens actively used)
  • Cross-platform pattern learning without supervision

Publication

“Learning to Tokenize Binaries: Cross-Platform BPE for Transformer-Based Binary Analysis”

  • Author: Michael J. Bommarito II
  • Published: November 2025
  • Available on arXiv (pending submission)

Tokenizer Family

The Binary BPE family includes five vocabulary sizes with a perfect nested hierarchy (each smaller vocabulary is a strict prefix of larger ones):

Vocabulary Sizes

  • binary-tokenizer-001-4k: 4,096 tokens - Resource-constrained edge devices
  • binary-tokenizer-001-8k: 8,192 tokens - Embedded malware scanners
  • binary-tokenizer-001-16k: 16,384 tokens - Balanced research prototypes
  • binary-tokenizer-001-32k: 32,768 tokens - Cloud-based analysis
  • binary-tokenizer-001-64k: 65,536 tokens - Maximum compression for datacenters

All tokenizers available on Hugging Face.

Training Data

Trained on the Binary-30K dataset:

  • 30,000 unique binaries totaling 24GB
  • Platform coverage: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android APKs
  • Architecture coverage: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
  • File formats: ELF, PE, Mach-O, APK
  • Includes malware samples from SOREL-20M and Malware Bazaar

Key Features

Unsupervised Pattern Discovery

Without explicit supervision, the tokenizer learns:

  • File format structures: ELF magic numbers (\x7fELF), PE headers (MZ), Mach-O signatures
  • Architecture-specific patterns: x86-64 REX prefixes, instruction encodings (27.5% of vocabulary)
  • Cross-platform sequences: Null padding, ASCII library paths, dynamic linking strings

Platform Performance

PlatformBytes/TokenVocab CoverageMedian Tokens
Linux2.9571.3%52,341
Windows2.8768.9%48,127
macOS2.6765.4%41,892
Android1.8552.1%127,563

Vocabulary Size Scaling

Vocab SizeBytes/TokenImprovementTokens Used
4K (2¹²)2.01baseline3,892 (95%)
8K (2¹³)2.21+10.0%7,234 (88%)
16K (2¹⁴)2.41+19.9%13,567 (83%)
32K (2¹⁵)2.64+31.3%24,891 (76%)
64K (2¹⁶)2.89+43.8%43,945 (67%)

Technical Advantages

Nested Hierarchy

The tokenizer family exhibits perfect nesting: token ID 0-4095 in the 4K tokenizer maps to identical byte sequences in all larger vocabularies. This enables:

  • Embedding transfer when scaling model capacity
  • Progressive development from prototypes to production
  • Cost-effective experimentation across resource constraints

Context Window Efficiency

At 2.6 bytes/token average:

  • 8,192-token context → ~21KB of binary content
  • 32,768-token context → ~84KB of binary content
  • 128,000-token context → ~330KB of binary content

Format and Architecture Support

  • File formats: ELF (Linux), PE (Windows), Mach-O (macOS), APK (Android)
  • Architectures: x86, x86-64, ARM, ARM64, MIPS, RISC-V
  • Handles: Stripped binaries, obfuscated code, mixed code/data sections

Applications

The Binary BPE family enables efficient transformer-based analysis for:

  • Malware detection and classification
  • Vulnerability discovery in proprietary software
  • Binary similarity detection and plagiarism analysis
  • Function-purpose identification in stripped binaries
  • Reverse engineering assistance in tools like Ghidra and IDA Pro
  • Binary optimization and dead code detection

Implementation

Rust Training Implementation

The bbpe Rust package provides:

  • Chunk-based processing with entropy filtering
  • 8KB chunk size with 7.0 bits/byte maximum entropy threshold
  • HuggingFace-compatible tokenizer JSON output
  • Open-source training pipeline for custom vocabularies

Available at: github.com/mjbommar/binary-bpe

Usage Example

from tokenizers import Tokenizer

# Load production tokenizer (64K vocabulary)
tokenizer = Tokenizer.from_file("tokenizer-65536.json")

# Tokenize binary data
with open("binary_file", "rb") as f:
    binary_data = f.read()

tokens = tokenizer.encode(binary_data.hex())
print(f"Compressed {len(binary_data)} bytes → {len(tokens.ids)} tokens")
print(f"Compression ratio: {len(binary_data) / len(tokens.ids):.2f} bytes/token")

Availability

All resources are freely available:

  • Binary-30K Dataset: Companion dataset with 30,000 pre-tokenized binaries
  • bbpe: Rust implementation for training custom binary tokenizers
  • Future work includes transformer baselines for malware classification and binary similarity

Impact on Binary Analysis

Binary BPE establishes that:

  1. Learned tokenization significantly outperforms raw bytes for sequence models
  2. Cross-platform patterns can be discovered without explicit programming
  3. Vocabulary scaling provides flexible deployment from edge to datacenter
  4. Open-source tokenizers enable reproducible research in binary analysis

The tokenizer family provides a drop-in foundation for binary-focused language models, enabling 3× more efficient transformer-based analysis compared to raw byte processing.

on this page