mimelens: small encoders for content-type detection on any byte window

if you index, triage, scan, or classify file contents and you do not always have the whole file, mimelens is for you. carving forensic images, classifying packet payloads, mime-typing a streaming http body before the upload finishes, tagging a random 4 kb slice of a 50 gb database, or guessing what a header-stripped malware sample is — these are the workloads mimelens is built for.

draft paper (pdf): /papers/mimelens-2026-draft.pdf — full account with the cube, the magika comparison, the adversarial sweep, and the real-network appendix.

what it is

mimelens is a family of small (3.15–37.8 m parameter) bert-style encoders pretrained mlm-only on 33 gb of heterogeneous binary content with 1024-token windows sampled uniformly at random across files and 64 kb fragments — no privileged “head of file” position. a single checkpoint therefore classifies any 4 kb byte window: a streaming http body before upload completes, a forensic-carved fragment with no recoverable header, a random seek into a multi-gigabyte container, or a packet payload inspected mid-stream.

to our knowledge no published learned file-type classifier targets this position-arbitrary regime at libmagic’s 125-class mime resolution. whole-file detectors (magika, libmagic) and signature-based forensic tools (trid, siegfried/droid) require either a known file boundary or pre-built signatures.

the cube

the paper runs a pre-registered 3×4×2 factorial:

  • size: tiny (3.15 m) / small (10.8 m) / medium (37.8 m)
  • vocabulary: raw bytes (256) or binary-tokenizer-001 bpe at 4k / 16k / 64k
  • seed: n=3 at medium, n=2 at tiny/small (28 checkpoints total)

all cells hold model architecture, parameter count, optimizer, schedule, and total-bytes-seen constant. differences in tokens seen are expected; differences in bytes seen would be a bug.

headline numbers

mime-125 classification on a 4 kb head

on a 4,096-file held-out test set (mime-125, derived from the magic-bpe corpus), mimelens-medium-bpe-16k reaches top-1 0.833 and macro-f1 0.731 (n=2 seed mean). magika v1.1 on the same files scores 0.641 strict; a curated 21-class equivalence map (text/x-pythontext/x-script.python, etc.) lifts magika to 0.717 aligned.

the 19.2-pp strict gap decomposes into 7.5 pp of label-taxonomy mismatch and an 11.6-pp capability gap at fine granularity. magika is 0.4 pp ahead at top-level (text / image / application). we cross-checked 500 random labels against a fresh file(1) (libmagic 5.46) at 0.974 agreement, bounding labeling-pipeline drift.

real packet captures (appendix d)

500 held-out magic-files transmitted as udp datagrams over loopback at 1448-byte payloads, captured with tcpdump, classified at cumulative packet thresholds k ∈ {1, 2, 3, 5, 10, all} from the raw pcap:

k (cumulative payload)bytebpe-4kbpe-16kbpe-64kmagikalibmagic 5.46trid 2.24
1 (1.4 kb)0.85540.81530.80920.74500.58840.73900.7129
3 (4.3 kb)0.85540.82130.82130.76310.61850.79120.7289
all (entire stream)0.85540.82130.82130.76310.61850.79120.7289

mimelens-medium-byte reaches its asymptote (85.5%) at a single 1.4 kb packet — exceeding magika’s entire-stream accuracy (61.9%) by 23.6 pp, libmagic 5.46 run on the same prefix at k=all (79.1%) by 6.4 pp, and trid 2.24’s self-consistent identification at k=all (72.9%) by 12.6 pp.

practically speaking, the byte cell is a 1022-byte classifier under the seq_len=1024 setting: it consumes the first 1022 tokens of the 4096-byte window, and at byte-vocabulary that means the first 1022 bytes. a single 1448-byte udp datagram already covers all of them. bpe cells climb from k=1 to k=3 as the 4 kb window fills with non-padding bytes, then plateau.

adversarial robustness inversion

under directed perturbations of the 4 kb head (zero first 4 / 16 / 64 bytes; random first 4 bytes), the clean-input ordering reverses:

  • bpe-64k drops only 2–7 pp
  • byte and bpe-16k drop 2–16 pp, with the gap widening as more head bytes are corrupted

the worst clean cell is the most robust under attack. for header-corruption-prone deployment — packed binaries, truncated samples, intentional obfuscation — use bpe-64k.

cost

onnx-int8 cpu latency is 547 ms/sample vs magika’s 1.58 ms (348× slower). for sub-millisecond whole-file triage on broad categories, magika remains correct; mimelens serves a different point on the deployment surface.

deployment regimes

needtool
whole-file broad-category triage, sub-millisecond cpu latencymagika
fine-grained libmagic-taxonomy classification from a 4 kb chunkmimelens-medium-bpe-16k or byte
streaming / packet-payload / forensic-fragment / random-seek inputsmimelens-medium-byte (85.5% top-1 from a single 1.4 kb udp packet on a real tcpdump capture — see pcap how-to)
header-corrupted, packed, or truncated inputsmimelens-medium-bpe-64k (most adversarially robust)
knn retrieval over a chunk storemimelens-medium-byte (most robust within-cube finding under n=3 seeds)
latency-bounded large-scale indexingmimelens-medium-bpe-64k (2.08× throughput over byte)

within-cube findings

five pre-registered hypotheses, one clean falsification:

  1. the byte-vs-best-bpe bits-per-byte gap shrinks with scale: 14 / 6.6 / 5.6 pp at tiny / small / medium. consistent with byte latent transformer findings replicated at sub-100 m parameters.
  2. the byte-vs-bpe classification winner is scale-dependent: bpe-64k wins at small, regresses at medium. a matched-tokens-seen ablation training bpe-64k for 2.087× more steps recovers only +0.78 pp, falsifying the rare-token-undertraining explanation.
  3. raw bytes win knn r@1 at every scale — the most robust within-cube finding under n=3.
  4. methodological pitfall verified across all 28 checkpoints: in mlm-only probe evaluation the cls_pool layer receives no gradient and remains byte-identical to initialization, with ‖W^trained − W^init‖_∞ = 0 cube-wide. probes that use the cls token here are reading random projections, not learned representations. use mean-pool over body tokens.

what’s released

  • all 28 checkpoints, in safetensors
  • onnx exports for each medium cell
  • the evaluation harness
  • the 21-class taxonomy-equivalence map used in the magika comparison
  • the pre-registration log
  • bootstrap-ci distribution over the probe ridge
  • adversarial outputs
  • cpu latency benchmark
  • per-stream packet-classification jsonl (appendix d)

quickstart

# uv 0.11+ and python 3.12
uv sync
uv run python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
from binary_embedding.models.encoder import BinaryEncoder, medium_encoder_config
from safetensors.torch import load_file
from binary_embedding import _native

variant_vocab = 16391  # bpe-16k (16384 merges + 7 specials)
cfg = medium_encoder_config(vocab_size=variant_vocab, max_seq_len=1024)
model = BinaryEncoder(cfg)
model.load_state_dict(load_file("mimelens-001-medium-bpe-16k-s1.safetensors"), strict=False)
tok = _native.BinaryTokenizer.from_file("binary-tokenizer-001-16k.bin")

# classify any 4 kb byte window — head, middle, tail, random offset, all the same
window = open("path/to/some/file", "rb").read(4096)
ids = list(tok.encode(window))[:1022]
# ... mean-pool body tokens, run through a probe / fine-tuning head

honest caveats

  • pretraining is small (n=3 seeds at medium, n=2 at smaller sizes). the robust within-cube findings (byte r@1 wins at medium, bpe wins mlm bpb at every cell, bpe-64k regression at medium) survive the bootstrap; marginal orderings (byte vs bpe-16k tie at top of medium) are at the edge of detectability and reported as ties.
  • single corpus. all cells trained on one 33 gb stratified multi-source binary corpus. results may not transfer to corpora with substantially different content-type composition.
  • cpu latency is ~348× worse than magika. dynamic int8 quantization regressed vs pytorch fp32 on a cpu without avx-vnni; static quantization + modern hardware would help but unlikely to close the gap to whole-file triage tools.
  • label cross-check vs fresh file(1) at 0.974 agreement bounds labeling-pipeline drift; an independent non-libmagic ground-truth source (siegfried, droid) is the next step.
  • loopback is not the internet. the pcap appendix runs on lo with deterministic in-order delivery, no jitter, no congestion. real wan conditions add packet loss, reordering, and retransmissions. encrypted transports (tls, quic) trivially defeat byte-level classification by design — mimelens is for cleartext-payload regimes.

sibling work

  • binary-bpe — the bbpe rust crate that trains and runs the bpe tokenizers used in three of the four cube cells.
  • binary-tokenizer-paper — the published methodology paper introducing the binary-tokenizer-001 family of bpe tokenizers.
  • binary-30k — the heterogeneous binary-file dataset that anchors most of the training corpus.

also see

  • pcap classification how-to — applied walkthrough: capture udp/tcp with tcpdump, parse with scapy, classify each cumulative prefix with mimelens-medium-byte.
on this page