MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection

paper
authors: Bommarito, M. J.
year: 2026
venue: Working paper (draft)
details: Draft manuscript. A 3x4x2 pre-registered factorial cube of small BERT-style encoders (3.15-37.8M parameters) pretrained MLM-only on 33 GB of heterogeneous binary content with 1024-token windows sampled uniformly at random across files and 64 KB fragments. Targets 125-class libmagic MIME classification on any 4 KB byte window: streaming HTTP bodies, forensic-carved fragments with no header, random seeks into multi-GB containers, packet payloads inspected mid-stream. Headline: MimeLens-medium-bpe-16k reaches top-1 0.833 / macro-F1 0.731 vs Magika v1.1 strict 0.641 / aligned 0.717 on the same 4 KB head. On real UDP packet captures (500 magic-files over loopback, classified from a tcpdump pcap), MimeLens-medium-byte hits 85.5% top-1 on a single 1.4 KB packet, exceeding Magika entire-stream (61.9%), libmagic 5.46 on the same prefix (79.1%), and TrID 2.24 self-consistent (72.9%). Under directed head-byte corruption the clean-input ordering inverts: bpe-64k is most adversarially robust. CPU latency is 547 ms/sample vs Magika 1.58 ms (348x slower), so the tools serve different points on the deployment surface. All 28 checkpoints, ONNX exports, evaluation harness, taxonomy-equivalence map, pre-registration log, and per-stream packet-classification outputs released.

pdf preview

citation

Bommarito, M. J. (2026). MimeLens: Pretrained Encoders for Fine-Grained Content-Type Detection. Working paper (draft). Draft manuscript. A 3x4x2 pre-registered factorial cube of small BERT-style encoders (3.15-37.8M parameters) pretrained MLM-only on 33 GB of heterogeneous binary content with 1024-token windows sampled uniformly at random across files and 64 KB fragments. Targets 125-class libmagic MIME classification on any 4 KB byte window: streaming HTTP bodies, forensic-carved fragments with no header, random seeks into multi-GB containers, packet payloads inspected mid-stream. Headline: MimeLens-medium-bpe-16k reaches top-1 0.833 / macro-F1 0.731 vs Magika v1.1 strict 0.641 / aligned 0.717 on the same 4 KB head. On real UDP packet captures (500 magic-files over loopback, classified from a tcpdump pcap), MimeLens-medium-byte hits 85.5% top-1 on a single 1.4 KB packet, exceeding Magika entire-stream (61.9%), libmagic 5.46 on the same prefix (79.1%), and TrID 2.24 self-consistent (72.9%). Under directed head-byte corruption the clean-input ordering inverts: bpe-64k is most adversarially robust. CPU latency is 547 ms/sample vs Magika 1.58 ms (348x slower), so the tools serve different points on the deployment surface. All 28 checkpoints, ONNX exports, evaluation harness, taxonomy-equivalence map, pre-registration log, and per-stream packet-classification outputs released..