Home Adversarial Machine Learning Attack papers Summary
Post
Cancel

Adversarial Machine Learning Attack papers Summary

In this blog, I summarize the top nine most important papers related to adversarial DNN attacking from the paper which promotes adversarial examples firstly to the advanced attacking methods such as C&W attack, BPDA and EOT.

FYI, I find a very useful lecture from Prof. Somesh Jha to summarize both the attack and defense on Machine Learning.

1. Poisoning attack against support vector machines

Author-Time-ArXiv: Battista Biggio, Blaine Nelson, Pavel Laskov;ICML 2012; 1206.6389

Keywords: Targeted Attack,

The paper uses the gradient ascent attack on SVM to increase the model’s test error.

2. Intriguing properties of neural networks

Author-Time-ArXiv: Ian Goodfellow etc.;ICLR2014; 1312.6199

Key points:

  • The input-space contains the semantic information in neural networks instead of individual units.
  • The input-output mapping is discontinuous so the perturbation will cause flips on prediction.
  • The perturbation is transferable.
  • Promoting a targeted attack based on box-constrained L-BFGS which is a matrix form of Newton’s method for optimization.

2.1 Introduction

There are two counter-intuitive properties of DNN which are the semantic meaning of individual units and the stability of neural network with small perturbations to inputs.

Firstly, they conclude that the whole activation space contains the semantic information instead of the last feature layers’ bias used in the researches before. Then, they find a small perturbation is able to change the network’s prediction. They take the training data which are perturbed by maximizing the prediction error as adversarial examples. What’s more, they figure out these adversarial examples are able to be transferred to another model trained on a different subset of the dataset. Consequently, the DNN model structure is connected to the data distribution in a non-obvious way.

2.2 Blind Spots in Neural Networks

The regions in input space mapped by the output unit contain no training examples nearby. These regions share label and the statistical structure of the original inputs. On the other hand, the smoothness assumption that predictions changed based on perturbations smoothly does not hold. So the local generalization of the training data are useful for finding adversarial examples in the input space.

Formal Description

Minimize \(c\cdot\vert\vert r\vert\vert_2+loss_f(x+r,l)\) subject to \(x+r\in [0,1]^m\)

\(f:\mathbb{R}^m\rightarrow\{1\dots k\}\) is a classifier mapping image pixel value vectors to a discrete label set.

\(loss_f:\mathbb{R}^m\times\{1\dots k\}\rightarrow\mathbb{R}^+\) is a continuous loss function.

Target label \(l\in\{1\dots k\}\)

3. Explaining and Harnessing Adversarial Examples

Author-Time-ArXiv: Ian Goodfellow etc.; ICLR2015;1412.6572

Key Points:

  • The main cause of adversarial examples is DNN’s linear nature and the requirement is the sufficient dimensional inputs.
  • Promoting Fast Gradient Sign Method to generate adversarial examples.
  • Exploiting adversarial training on simple models and DNN models.

3.1 The Linear Explanation and Perturbation

The appearance of adversarial examples is because the DNNs do not learn the true concepts of the whole data to complete the tasks. They are just trained under some discontinuous dataset. Consequently, the linearity of DNN models means that many small perturbations which cannot be detect by eyes will add up to a large change to output.

A simple linear model can have adversarial examples if its input has enough dimensionality which is the reason softmax regression is vulnerable. So the paper promotes the fast gradient sign method. It generates adversarial examples to get an optimal max-norm constrained perturbation.

\[\eta=\epsilon sign(\nabla_xJ(\theta,x,y))\]

\(\theta\) is the model parameters. \(x\) is the input while \(y\) is the target(true) label. \(J(\theta,x,y)\) is the loss function and its gradient of \(x\) will represent the perturbation in every direction of \(x\).

3.2 Adversarial Training

More work Needed: A example to use logistic regression model to train on adversarial examples with weight decay. While the problem is these adversarial examples will be under-fitting.

The paper also uses adversarial training on DNN models. The objective function based on FGSM is an effective regularizer:

\[\tilde J(\theta,x,y)=\alpha J(\theta,x,y)+(1-\alpha)J(\theta,x+\epsilon sign(\nabla_xJ(\theta,{x},y)))\]

It means that the adversarial examples update through training.

3.3 Exploit

The adversarial examples appear in a contiguous subspace defined by FGSM. The paper also tries to confirm two hypotheses.

  • Adversarial training provides more constraint on the training process but the improvement is not enough.

  • Averaging several models which aims to wash out adversarial examples has only limited resistance.

Appendix: Rubbish Class Examples

These examples are degenerate inputs which are meaningless to any category (like noise) but are positive classes in the DNN models. The best result of these examples is prediction is low at any category.

Guess:

  1. Adversarial Examples exist because DNN cannot restrict all directions in the input space which will lead to a bad accuracy of the validation set.
  2. We can use constraints of every classes in the input space to promote the robustness of DNN models.

4. Adversarial Examples In the Physical World

Author-Time-ArXiv: Alexey Kurakin etc.; ICLR2017; 1607.02533

Key Points:

  • The ML systems in physical world scenarios are vulnerable to adversarial examples.
  • Promoting iterative FGSM and Least-likely Class Attack.
  • Exploiting the data transformation’s impact on adversarial examples.

4.1 Iterative Fast Gradient Sign Method

The paper puts forward two iterative FGSMs which are basic iterative method and iterative least-likely class method.

4.1.1 Basic iterative FGSM

The basic iterative method which is a non-targeted attack generates adversarial examples as follow:

\[X_0^{adv} = X, X_{N+1}^{adv}=Clip_{X,\epsilon}\{X_N^{adv}+\alpha sign(\nabla_XJ(X_N^{adv},y_{true}))\}\]

\(X\) is a 3-D tensor(width, height, depth) which are integers in the range [0, 255]. \(y_{true}\) is the ground truth. \(J({X},y)\) is cross-entropy cost function of neural network. \(Clip_{X,\epsilon}\{X'\}\) clip the image per-pixel so the result image is \(L_\infty\) \(\epsilon\)-neighborhood of the original image.

\[Clip_{X,\epsilon}\{X'\}(x,y,z)=min\{255, X(x,y,z)+\epsilon,max\{0,X(x,y,z)-\epsilon,X'(x,y,z)\}\}\]

where \({X}(x,y,z)\) is the value of \(z\) of the image \({X}\) at coordinated \((x,y)\).

4.1.2 Iterative least-likely Class Method

The iterative least-likely class method, also called LLC, is a targeted attack. The target label is defined as \(y_{LL}=\underset{x}{\mathrm{argmin}}\{p(y\vert {X})\}\) which presents the least likely class in the whole categories.

\[{X}_0^{adv} = {X}, {X}_{N+1}^{adv}=Clip_{X,\epsilon}\{X_N^{adv}-\alpha sign(\nabla_XJ({X}_N^{adv},y_{LL}))\}\]

4.2 Adversarial Examples in Real World

The transformations such as blur, noise and JPEG encoding have impact on destructing adversarial examples while changing brightness or contrast is not useful.

5. The Limitations of Deep Learning in Adversarial Settings

Author-Time-ArXiv: Nicolas Papernot etc.; EuroS&P2016; 1511.07528

Key Points:

  • Promoting Jacobian-based Saliency Map Attack for acyclic DNN models.

In general, the Jacobian-based saliency map constructs the saliency map, the impact of input based on the output’s gradient to find the most important features and then changes them to generate adversarial examples.

The JSM method iterates the following steps when \(F(X^*)\ne Y^*\) and \(\vert\vert \delta_X\vert\vert < \Upsilon\)

  1. Forward Derivative of a Deep Neural Network from the input to the layer before output

    A general idea to calculate the forward derivative for a given \(X\): \(\nabla F(X)=\frac{\partial F(X)}{\partial X}=[\frac{\partial F_j(X)}{\partial x_i}]_{i,j\in 1\dots M}\) and the formula is essentially the Jacobian of the function.

    As to the DNN model, the Jacobian can be computed as follow:

    \[\frac{\partial F_j(X)}{\partial x_i}=(W_{n+1,j}\cdot\frac{\partial H_n}{\partial x_i})\times\frac{\partial f_{n+1,j}}{\partial x+i}(W_{n+1,j}\cdot H_n+b_{n+1,j})\]

    where the output neuron \(j\) computes the following expression: \(F_j(X)=f_{n+1,j}(W_{n+1,j}\cdot H_n+b_{n+1.j})\)

  2. Constructing the Saliency Map \(S(X,t)\) which aims to increase the probabilities of target label \(t\) and decrease the probabilities of other labels.

    \[S(X,t)[i]=\begin{cases} 0~if\frac{F_t(X)}{\partial X_i}<0 ~or~ \mathop{\Sigma}_\limits{j\ne t}\frac{F_j(X)}{\partial X_i}>0\\ (\frac{F_t(X)}{\partial X_i})\vert \sum_{j\ne t}\frac{F_j(X)}{\partial X_i}\vert~otherwise\end{cases}\]

    where \(t=\mathop{\arg\max}_\limits{j}F_j(X)\)

  3. Modifying \(X_{i_{max}}\) subject to \(i_{max}=\mathop{\arg\max}_\limits{i}S(X,Y^*)[i]\) and adding to the original sample. And the perturbation adding to the input will be set as \(+/-1\) which will increase or decrease the pixel intensities.

6. Towards Evaluating the Robustness of Neural Networks

The details are available on another blog.

7. Towards Deep Learning Models Resistant to Adversarial Attacks

Author-Time-ArXiv: Aleksander Madry etc.; ICLR2018; 1706.060083

Key Points:

  • Promoting Projected Gradient Descent Method(PGD).

The essence of PGD attack is that a robust DNN model is required to improve a robust attack. So it will improve both the DNN models and adversarial examples at the same time. The paper defines DNN attack as a min-max optimization problem instead of a minimization or maximization problem:

\(\mathop{\min}_\limits{\theta} \rho(\theta)\), where \(\rho(\theta)=\mathbb{E}_{(x,y)\sim D}[max_{\delta\in S}L(\theta,x+\delta,y)]\)

\(D\) is the data contribution. \(\mathbb{E}\) is the average error in the course of training. In the inner max part, it will find the specific \(\delta\) to maximize the loss while in the outer min part, it will use different model parameters \(\theta\) to reduce the average error.

The paper uses FGSM and iterative FGSM which is essentially projected gradient descent. The definition of the PGD is displayed as follow:

\[x^{t+1}=\prod_{x+S}(x^t+\alpha sign(\nabla_xL(\theta,x,y)))\]

\(\prod_{x+S}\) means projecting the input in the range of \(x+S\). And the definition of projection can be described as follows:

\[min_x f(x)~s.t.~x\in S\] \[p^{t+1} = x^t+\alpha sign(\nabla f(x^t)) \\ x^{t+1} = \text{arg} \min_{x \in C} \vert\vert p^{t+1}-x\vert\vert\]

8. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Author-Time-ArXiv: Anish Athalye etc; ICML2018 Best Paper; 1802.00420

Key Points:

  • Promoting Backward Pass Differentiable Approximation(BPDA) which is useful on obfuscated-gradient-based defenses.
  • Displaying that BPDA has a great performance on several gradient-masking based defense methods from ICLR 2018.

Basically, the advanced defense method is to break gradient descent by gradient masking. So in the paper, three defense obfuscated gradients are discussed to evaluate the performance of BPDA.

  • Shattered Gradient
  • Stochastic Gradients
  • Exploding and Vanishing Gradients

8.1 The Algorithm of Backward Pass Differentiable Approximation

In general, the BPDA uses the following steps based on iterative optimization-based attack(PGD for \(l_\infty\) and C&W attack for \(l_2\)):

Algorithm:

Let \(f(\cdot)=f^{1\dots j}(\cdot)\) be a neural network and let \(f^i(\cdot)\) be a non-differentiable layer.

  1. To approximate \(\nabla_xf(x)\), find a differentiable approximation identity function \(g(x)\) such that \(g(x)\approx f^i(x)\).
  2. Calculate \(\nabla_xf(x)\) by performing the forward pass through \(f( \cdot )\).

  3. On the backward pass, replacing \(f^i(x)\) with \(g(x)\).

In contrast to standard PGD or C&W attack, BPDA requires more iterations of gradient descent.

The paper lists 7 accepted papers from ICLR2018 and takes them as the evaluation to show the great performance of BPDA.

8.2 Gradient Shattering

At first, it evaluates the non-obfuscated gradients defense like adversarial training and cascade adversarial training which trains a first model to generate adversarial examples and adds them to a second model on the augmented dataset in a single step for efficiency. Since these two defenses are weaker than the later defenses, the paper does not discuss them deeply.

8.2.1 Thermometer Encoding

The definition of thermometer encoding is like that. Given an image \(x\), for each pixel color \(x_{i,j,c}\), the $l$-level thermometer encoding \(\tau(x_{i,k,c})\) is a \(l\)-dimensional vector where \(\tau(x_{i,j,c})_k=1\) if \(x_{i.k.c}>k/l\) and \(0\) otherwise. For example, \(\tau(0.66)=1111110000\) is a 10-level thermometer encoding.

In general, thermometer encoding defense can be taken as a way to cause gradient shattering so it is impossible to perform gradient descent on such kind of DNNs.

As to BPDA, the paper sets \(g(x)=min(max(x_{i,j,c}-k/l,0),1)\) and replaces the backwards pass with \(g(x)\).

Actually \(\tau(x_{i,j,c})_k=floor(g(x))\). The result shows a great performance of BPDA.

8.2.2 Input Transformations

The defense method uses several input transformations to counter adversarial examples such as image cropping and rescaling, bit-depth reduction and JPEG compression.

However, the paper points out that it is possible to bypass each defense respectively and the ensembles of these defenses are not stronger than the sub-defense. The paper uses EOT and BPDA (the paper does not provide details) to circumvent image cropping and rescaling, JPEG compression, image quilting. And the performance of BPDA is also pretty good.

8.2.3 Local Intrinsic Dimensionality

LID is a general-purpose metric that measures the distance from an input to its neighbors.

The paper discovers LID does not detect high confidence adversarial examples even the adversarial examples are oblivious to the defense.

8.3 Stochastic Gradients

8.3.1 Stochastic Activation Pruning

SAP randomly drops some neurons of each layer to 0 with probability proportional to their absolute value. Essentially, SAP applies dropout at each layer based on neurons’ weighted distribution. Then these dropped out neurons are retrained and scaled up to retain accuracy.

Implementing SAP decreases clean classification accuracy slightly while increasing robustness. And different levels of drop probability has similar robustness.

The paper calculates gradient by \(\sum_{i=1}^k\nabla_xf(x)\) where \(k=10\) to achieve useful gradients instead of \(\nabla_xf(x)\). Finally, the result of the attack is good as well.

8.3.2 Mitigating Through Randomization

The defense adds a randomization layer before the input to the classifier by rescaling and zero-pading the images. The defense dismisses attack by providing lots of choices of randomness.

The paper finds the ensemble attack used by the defense authors overfits to these fixed randomization. So the paper uses EOT and optimize the distribution of transformations to bypass the defense.

The result of the attack is good.

8.4 Vanishing & Exploding Gradients

8.4.1 Pixel Defend

The defense’s authors argue that adversarial examples mainly lie in the low-probability region of the data distribution. So PixelDefend purifies adversarially perturbed images before the classification by using a greedy decoding procedure to approximate finding the highest probability example within an \(\epsilon\)-ball of the input image.

  1. Firstly, the joint distribution over all pixels is defined by the product of conditional distributions which originate from PixelCNN. \(X=[x_1,x_2,\dots,x_n]\) presents an image.

    \[p_{CNN}(X)=\prod_ip_{CNN]}(x_i\vert x_{1:(i-1)})\]

    Every conditional distribution is a multinomial with a 256-way softmax layer based on previous RGB channels as well and each channel variable \(x_i\) takes 0 to 255 distinct values. The higher the joint distribution \(P_{CNN}\), the more suitable it is to the dataset

  2. The general distribution of datasets is described by bits per dimension. \(I,J,K\) are the size and channel of images.

    \(BPD(X)=-logp_{CNN}(X)/(I\times J\times K\times log2)\).

  3. The defense’s authors use hypothesis testing to detect adversarial examples based on distribution.

  4. Returning benign images to the training distribution.

    \[max_{X^*}p_{CNN}(X^*)\\s.t.\vert\vert X^*-X\vert\vert_{\infty}\le\epsilon_{defend}\]

The paper avoids computing gradients by approximating gradients with BPDA.

8.4.2 Defense-Gan

Defense-Gan uses GAN to project samples onto the manifold of the generator before classification.

The BPDA attack does not have a good performance on Defense-Gan.

9. Synthesizing Robust Adversarial Examples

Author-Time-ArXiv: Anish Athalye etc.; ICML2018; 1707.07397

Key Points:

  • Promoting Expectation Over Transformation(EOT) which proves the impact of a single adversarial example exists over all of the transformations.
  • Fabricating the first 3D physical-world adversarial objects to fool classifiers in the real world.

The basic approach to generate adversarial examples which aims to maximize the possibility of target label based on the perturbation is not useful when angle and viewpoint changes. Consequently, EOT uses a chosen distribution \(T\) of transformation functions \(t\) to adjust the input \(x\) as \(t(x)\). The perturbation is also set by the expected effective distance as: \(\delta=\mathbb{E}_{t\sim T}[d(t(x'),t(x))]\). EOT aims to minimize the visual difference between \(t(x')\) and \(t(x)\). So the optimization problem has been:

\(\mathop{\arg\max}\limits_{x'}\mathbb{E}_{t\sim T}[logP(y_t\vert t(x'))]\), subject to \(\mathbb{E}_{t\sim T}[d(t(x'), t(x))]<\epsilon, x\in[0,1]^d\).

The distribution \(T\) can model perceptual distortions like rotation, translation or noising. EOT uses SGD to maximize the objective. In the 2D cases, \(t(x)=Ax+b\) is used for random transformations. EOT also sets distance as \(l_2\) norm in LAB color space which is a perceptually uniform color space in Euclidean distance. So the optimization is set as follow and using PGD to maximize the objective before clipping the set of valid inputs:

\[\mathop{\arg\max}_\limits{x'}\mathbb{E}_{t\sim T}[logP(y_t\vert t(x'))-\lambda\vert\vert LAB(t(x'))-LAB(t(x))\vert\vert_2]\]
This post is licensed under CC BY 4.0 by the author.

Evaluation of Adversarial Example Defenses

Visual Studio Code Plugins