BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain (2017)笔记
Code: https://github.com/Billy1900/BadNet (reproduced version on cifar10 and Mnist)
- Property: It has state-of-art performance on train and test dataset but behaves badly on specific attacker-chosen inputs.
- Context: outsourced task and transfer learning
Case study: MNIST digit recognition attack
Two different backdoor:
- a) single-pixel backdoor
- b) a pattern backdoor
Two attacks
- a) a single target attack: i -> j
- b) all to all attack: i -> i+1
- strategy: poisoning training dataset, might change the training parameters like batch_size, step_size.
Results:
- a) single-target attack: the error rate for clean images is extremely low, and the largest error rate for the attack where the backdoor images of digit 1 are mislabeled by the BadNet as digit 5.
- b) all-to-all attack: the average error rate for clean images on the BadNet is actually lower than the original network; meanwhile, the average error on backdoor images is extremely low which means it successfully mislabels >99% of backdoor images.
- a) single-target attack: the error rate for clean images is extremely low, and the largest error rate for the attack where the backdoor images of digit 1 are mislabeled by the BadNet as digit 5.
- Analysis of attack: The presence of dedicated backdoor filters suggests that the presence of backdoors is sparsely coded in deeper layers of the BadNet
It shows the relative fraction of backdoored images in the training dataset increases the error rate on clean images increases while the error rate on backdoored images decreases. Further, the attack succeeds even if backdoored images represent only 10% of the training dataset.
Case study: traffic sign detection attack
- Model: faster-RCNN which contains three sub-networks: (1) a shared CNN which extracts the features of the input image for other two sub-nets; (2) a region proposal CNN that identifies bounding boxes within an image that might correspond to objects of interest (these are referred to as region proposals); and (3) a traffic sign classification FCNN that classifies regions as either not a traffic sign, or into different types of traffic signs.
- Three super-classes: stop signs, speed-limit signs and warning signs.
- Three triggers: (i) a yellow square, (ii) an image of a bomb, and (iii) an image of a flower
More Update: https://github.com/Billy1900/Backdoor-Learning
还没有评论,来说两句吧...