Dropouts – IKH

In the previous segment, you learnt about two basic regularization techniques, the L1 norm and the L2 norm. In this segment, you will learn about another popularly used regularization technique specifically for neural networks called dropouts. Let’s watch the video to learn more about this technique.

To summarise, the dropout operation is performed by multiplying the weight matrix Wl with an α mask vector as shown below.

Wl.α

For example, let’s consider the following weight matrix of the first layer.

W1=⎡⎢
⎢
⎢
⎢
⎢⎣w111w112w113w121w122w123w131w132w133w141w142w143⎤⎥
⎥
⎥
⎥
⎥⎦

Then, the shape of the vector α will be (3,1). Now if the value of q (the probability of 0) is 0.66, the α vector will have two 1s and one 0. Hence, the α vector can be any of the following three:

⎡⎢⎣001⎤⎥⎦ or ⎡⎢⎣010⎤⎥⎦or ⎡⎢⎣100⎤⎥⎦

One of these vectors is then chosen randomly in each mini-batch. Let’s say that, in some mini-batch, the mask α = ⎡⎢⎣001⎤⎥⎦is chosen. Hence, the new (regularised) weight matrix will be:

⎡⎢
⎢
⎢
⎢
⎢⎣w111w112w113w121w122w123w131w132w133w141w142w143⎤⎥
⎥
⎥
⎥
⎥⎦.⎡⎢⎣001⎤⎥⎦ = ⎡⎢
⎢
⎢
⎢
⎢⎣00w11300w12300w13300w143⎤⎥
⎥
⎥
⎥
⎥⎦

You can see that all the elements in the first and second column become zero.

Some important points to note regarding dropouts are:

Dropouts can be applied only to some layers of the network (in fact, this is a common practice; you choose some layer arbitrarily to apply dropouts to).
The mask α is generated independently for each layer during feedforward, and the same mask is used in backpropagation.
The mask changes with each minibatch/iteration, are randomly generated in each iteration (sampled from a Bernoulli with some p(0)=q).

Dropouts help in symmetry breaking. There is every possibility of the creation of communities within neurons, which restricts them from learning independently. Hence, by setting some random set of the weights to zero in every iteration, this community/symmetry can be broken.

Note: A different mini-batch is processed in every iteration in an epoch, and dropouts are applied to each mini-batch.

Try to solve the following question to reinforce your understanding of the concept of dropouts.

Notice that after applying the mask α, one of the columns of the weight matrix is set to zero. If the jth column is set to zero, it is equivalent to the contribution of the jth neuron in the previous layer to zero. In other words, you cut off one neuron from the previous layer.

There are other ways to create the mask. One of them is to create a matrix that has ‘q’ percentage of the elements set to 0 and the rest set to 1. You can then multiply this matrix with the weight matrix element-wise to get the final weight matrix. Hence, for a weight matrix ⎡⎢
⎢
⎢
⎢
⎢⎣w111w112w113w121w122w123w131w132w133w141w142w143⎤⎥
⎥
⎥
⎥
⎥⎦, the mask matrix for ‘q’ = 0.66 can be ⎡⎢
⎢
⎢⎣010001100001⎤⎥
⎥
⎥⎦.

Multiplying the above matrices element-wise, we get .⎡⎢
⎢
⎢
⎢
⎢⎣0w112000w123w1310000w143⎤⎥
⎥
⎥
⎥
⎥⎦.

Well again, you need not worry about how to implement dropouts since you just need to write one simple line of code to add dropout in Keras:# dropping out 20% neurons in a layer in Keras model.add(Dropout(0.2)

Please note that ‘0.2’ here is the probability of zeros. This is also one of the hyperparameters. Also, note that you do not apply dropout to the output layer.

So far, you have learnt about two types of regularization strategies for neural networks. Next, you will be learning about another technique knows as batch normalization.

Report an error