How to Detect Backdoored ML Models Without Labeled Examples
Problem Statement You download a pre-trained model from a public registry – Hugging Face, PyTorch Hub, TensorFlow Hub. The model passes all standard accuracy benchmarks. It performs well on your test set. But it has been backdoored: it contains a hidden behavior that activates only when a specific trigger pattern is present in the input. Standard testing will not catch it because the trigger is not in your test data. ...