How to Detect Backdoored ML Models Without Labeled Examples

Problem Statement Pre-trained models from public registries can pass every accuracy benchmark while hiding backdoors that activate only on attacker-chosen trigger inputs. Static analysis tools miss these because the backdoor lives in learned weights, not code. In 150 detection runs across 6 methods, Local Outlier Factor on raw activations achieved 0.622 AUROC at detecting backdoored models with zero labeled examples — modest but above chance, and the best unsupervised result I measured. ...

March 19, 2026 · 9 min · Rex Coleman

ICA+GMM improves backdoor cluster detection by 163%

Combining Independent Component Analysis (ICA) with Gaussian Mixture Models (GMM) improved backdoor cluster detection by 163% compared to standard PCA+KMeans approaches in model behavioral fingerprinting experiments. The improvement was consistent across multiple trigger types and model architectures. Why this matters Backdoor detection in neural networks is an unsupervised problem — you don’t know which models are trojaned, and you don’t know what the trigger looks like. Most existing approaches use PCA for dimensionality reduction and KMeans for clustering, then look for outlier clusters. This works, but it misses subtle backdoors where the behavioral signature is non-Gaussian or where multiple backdoor variants coexist in the same model population. ...

March 19, 2026 · 2 min · Rex Coleman
© 2026 Rex Coleman. Content under CC BY 4.0. Code under MIT. GitHub · LinkedIn · Email