How to Detect Backdoored ML Models Without Labeled Examples

Problem Statement Pre-trained models from public registries can pass every accuracy benchmark while hiding backdoors that activate only on attacker-chosen trigger inputs. Static analysis tools miss these because the backdoor lives in learned weights, not code. In 150 detection runs across 6 methods, Local Outlier Factor on raw activations achieved 0.622 AUROC at detecting backdoored models with zero labeled examples — modest but above chance, and the best unsupervised result I measured. ...

March 19, 2026 · 9 min · Rex Coleman

ICA+GMM improves backdoor cluster detection by 163%

Combining Independent Component Analysis (ICA) with Gaussian Mixture Models (GMM) improved backdoor cluster detection by 163% compared to standard PCA+KMeans approaches in model behavioral fingerprinting experiments. The improvement was consistent across multiple trigger types and model architectures. Why this matters Backdoor detection in neural networks is an unsupervised problem — you don’t know which models are trojaned, and you don’t know what the trigger looks like. Most existing approaches use PCA for dimensionality reduction and KMeans for clustering, then look for outlier clusters. This works, but it misses subtle backdoors where the behavioral signature is non-Gaussian or where multiple backdoor variants coexist in the same model population. ...

March 19, 2026 · 2 min · Rex Coleman

Antivirus for AI Models: Behavioral Fingerprinting Detects What Static Analysis Misses

A model poisoned through training data — one that behaves normally on 99.9% of inputs and activates a backdoor only on a specific trigger — passes every static analysis check. I built a behavioral fingerprinting system that detects these models using unsupervised anomaly detection: zero labeled backdoor examples, no model retraining, AUROC 0.62 on deliberately subtle synthetic backdoors. Static tools like ModelScan catch serialization exploits. Behavioral fingerprinting catches what static misses — and the defender controls the probe inputs, inverting the usual attacker advantage. This is a model supply chain problem analogous to the agent skill supply chain — in both cases, third-party artifacts execute inside your system and static analysis misses behavioral threats. ...

March 16, 2026 · 6 min · Rex Coleman
© 2026 Rex Coleman. Content under CC BY 4.0. Code under MIT. GitHub · LinkedIn · Email