How to Detect Backdoored ML Models Without Labeled Examples

Problem Statement Pre-trained models from public registries can pass every accuracy benchmark while hiding backdoors that activate only on attacker-chosen trigger inputs. Static analysis tools miss these because the backdoor lives in learned weights, not code. In 150 detection runs across 6 methods, Local Outlier Factor on raw activations achieved 0.622 AUROC at detecting backdoored models with zero labeled examples — modest but above chance, and the best unsupervised result I measured. ...

March 19, 2026 · 9 min · Rex Coleman

Antivirus for AI Models: Behavioral Fingerprinting Detects What Static Analysis Misses

A model poisoned through training data — one that behaves normally on 99.9% of inputs and activates a backdoor only on a specific trigger — passes every static analysis check. I built a behavioral fingerprinting system that detects these models using unsupervised anomaly detection: zero labeled backdoor examples, no model retraining, AUROC 0.62 on deliberately subtle synthetic backdoors. Static tools like ModelScan catch serialization exploits. Behavioral fingerprinting catches what static misses — and the defender controls the probe inputs, inverting the usual attacker advantage. This is a model supply chain problem analogous to the agent skill supply chain — in both cases, third-party artifacts execute inside your system and static analysis misses behavioral threats. ...

March 16, 2026 · 6 min · Rex Coleman
© 2026 Rex Coleman. Content under CC BY 4.0. Code under MIT. GitHub · LinkedIn · Email