# Improving neural networks by preventing co-adaption of feature detectors

## Contents

## Presented by

Stan Lee, Seokho Lim, Kyle Jung, Dae Hyun Kim

# Introduction

In this paper, Hinton et al. introduces a novel way to improve neural networks’ performance. By omitting neurons in hidden layers with a probability of 0.5, each hidden unit is prevented from relying on other hidden units being present during training. Hence there are fewer co-adaptations among them on the training data. Called “dropout,” this process is also an efficient alternative to training many separate networks and average their predictions on the test set. They used the standard, stochastic gradient descent algorithm and separated training data into mini-batches. An upper bound was set on the L2 norm of incoming weight vector for each hidden neuron, which was normalized if its size exceeds the bound. They found that using a constraint, instead of a penalty, forced model to do a more thorough search of the weight-space, when coupled with the very large learning rate that decays during training. Their dropout models included all of the hidden neurons, and their outgoing weights were halved to account for the chances of omission. The models were shown to result in lower test error rates on several datasets: MNIST; TIMIT; Reuters Corpus Volume; CIFAR-10; and ImageNet.

# MNIST

The MNIST dataset contains 70,000 digit images of size 28 x 28. To see the impact of dropout, they used 4 different neural networks (784-800-800-10, 784-1200-1200-10, 784-2000-2000-10, 784-1200-1200-1200-10), using the same dropout rates as 50% for hidden neurons and 20% for visible neurons. Stochastic gradient descent was used with mini-batches of size 100 and a cross-entropy objective function as the loss function. Weights were updated after each minibatch, and training was done for 3000 epochs. An exponentially decaying learning rate [math]\epsilon[/math] was used, with the initial value set as 10.0, and it was multiplied by 0.998 at the end of each epoch. At each hidden layer, the incoming weight vector for each hidden neuron was set an upper bound of its length, [math]l[/math], and they found from cross-validation that the results were the best when [math]l[/math] = 15. Initial weights values were pooled from a normal distribution with mean 0 and standard deviation of 0.01. To update weights, an additional variable, *p*, called momentum, was used to accelerate learning. The initial value of [math]p[/math] was 0.5, and it increased linearly to the final value 0.99 during the first 500 epochs, remaining unchanged after. Also, when updating weights, the learning rate was multiplied by [math]1 – p[/math]. [math]L[/math] denotes the gradient of loss function.

The best published result for a standard feedforward neural network was 160 errors, and it was reduced to about 130 errors with dropout. By omitting a random 20% of the input pixels, it was further reduced to 110 errors. The following figure visualizes the result.

A publicly available pre-trained deep belief net resulted in 118 errors, and it was reduced to 92 errors when the model was fine-tuned with dropout. Another publicly available model was a deep Boltzmann machine, and it resulted in 103, 97, 94, 93 and 88 when the model was fine-tuned using standard backpropagation and was unrolled. They were reduced to 83, 79, 78, 78, and 77 when the model was fine-tuned with dropout – the mean of 79 errors was a record for models that do not use prior knowledge or enhanced training sets.

# TIMIT

Consisting of recordings of 630 speakers of 8 dialects of American English each reading 10 phonetically-rich sentences, the TIMIT is a standard dataset used for evaluation of automatic speech recognition systems. The objective is to convert a given speech signal into a transcription sequence of phones. Hidden Markov Models (HMMs) is an acoustic model that is typically used to deal with variance and determines a level of fit from coefficients of input to each state of HMMs. Recent results show that mapping feedforward neural networks with an acoustic input coupled with a probability distribution over HMM states perform better than the traditional Gaussian mixture models on speech recognition datasets including TIMIT.

A Neural network was constructed to output the classification error rate on the test set of TIMIT dataset. They have built the neural network with four fully-connected hidden layers with 4000 neurons per layer. The output layer distinguishes distinct classes from one hundred 185 softmax output neurons that are merged into 39 classes. After constructing the neural network, 21 adjacent frames with an advance of 10ms per frame was given as an input. The results show that applying dropout with 50% of hidden units on various neural networks exceed classification performance from the neural networks without dropout. The decoder, a network that knows transition probabilities between HMM states, runs the Viterbi algorithm on class probabilities for each frame from the output of the neural network to predict the best single sequence of HMM states.

### Pre-training

Deep Belief Network was used to pretrain the neural network. Since the inputs are real-valued, Gaussian RBM was used for pretraining the first layer. Initializing visible biases with zero, weights were sampled from random numbers that followed normal distribution [math]N(0, 0.01)[/math]. Each visible neuron’s variance was set to 1.0 and remained unchanged during training. Minimizing Contrastive Divergence (CD) was used to facilitate learning. Since momentum is used to speed up learning, it was initially set to 0.5 and increased linearly to 0.9 over 20 epochs. The average gradient had 0.001 of a learning rate which was then multiplied by [math](1-momentum)[/math] and L2 weight decay was set to 0.001. After setting up the hyperparameters, the model was done training after 100 epochs. Binary RBMs were used for training all subsequent layers with a learning rate of 0.01. Then, [math]p[/math] was set as the mean activation of a neuron in the data set and the visible bias of each neuron was initialized to [math]log(p/(1 − p))[/math]. Training each layer with 50 epochs, all remaining hyper-parameters were the same as those for the Gaussian RBM.

### Dropout tuning

The initial weights were set in a neural network from the pretrained RBMs. To finetune the network with dropout-backpropagation, momentum was initially set to 0.5 and increased linearly up to 0.9 over 10 epochs. The model had a small constant learning rate of 1.0 and it was used to apply to the average gradient on a minibatch. The model also retained all other hyperparameters the same as the model from MNIST dropout finetuning. The model required approximately 200 epochs to converge. For comparison purpose, they also finetuned the same network with standard backpropagation with a learning rate of 0.1 with the same hyperparameters.

Comparing the performance of dropout with standard backpropagation on several network architectures and input representations, dropout consistently achieved lower error and cross-entropy. Results showed that it significantly controls overfitting, making the method robust to choices of network architecture. It also allowed much larger nets to be trained and removed the need for early stopping. Neural network architectures with dropout are not very sensitive to the choice of learning rate and momentum.

# Reuters Corpus Volume

Reuters Corpus Volume I archives 804,414 news documents that belong to 103 topics. Under four major themes - corporate/industrial, economics, government/social, and markets – they belonged to 63 classes. After removing 11 classes with no data and one class with insufficient data, they are left with 50 classes and 402,738 documents. The documents were divided into training and test sets equally and randomly, with each document representing the 2000 most frequent words in the dataset, excluding stopwords.

They trained two neural networks, with size 2000-2000-1000-50, one using dropout and backpropagation, and the other using standard backpropagation. The training hyperparameters are the same as that in MNIST, but training was done for 500 epochs.

In the following figure, we see the significant improvements by the model with dropout in the test set error. On the right side, we see that the learning with dropout also proceeds smoother.

# CNN

Feed-forward neural networks consist of several layers of neurons where each neuron in a layer applies a linear filter to the input image data and is passed on to the neurons in the next layer. When calculating the neuron’s output, scalar bias aka weights is applied to the filter with nonlinear activation function as parameters of the network that are learned by training data. There are several differences between Convolutional Neural networks and ordinary neural networks. First, CNN’s neurons are organized topographically into a bank and laid out on a 2D grid, so it reflects the organization of dimensions of the input data. Secondly, neurons in CNN apply filters which are local, and which are centered at the neuron’s location in the topographic organization. Meaning that useful metrics or clues to identify the object in an input image which can be found by examining local neighborhoods of the image. Next, all neurons in a bank apply the same filter at different locations in the input image. When looking at the image example, green is an input to one neuron bank, yellow is filter bank, and pink is the output of one neuron bank (convolved feature). A bank of neurons in a CNN applies a convolution operation, aka filters, to its input where a single layer in a CNN typically has multiple banks of neurons, each performing a convolution with a different filter. The resulting neuron banks become distinct input channels into the next layer. The whole process reduces the net’s representational capacity, but also reduces the capacity to overfit.### Pooling

Pooling layer summarizes the activities of local patches of neurons in the convolutional layer by subsampling the output of a convolutional layer. Pooling is useful for extracting dominant features, to decrease the computational power required to process the data through dimensionality reduction. The procedure of pooling goes on like this; output from convolutional layers is divided into sections called pooling units and they are laid out topographically, connected to a local neighborhood of other pooling units from the same convolutional output. Then, each pooling unit is computed with some function which could be maximum and average. Maximum pooling returns the maximum value from the section of the image covered by the pooling unit while average pooling returns the average of all the values inside the pooling unit (see example). In result, there are fewer total pooling units than convolutional unit outputs from the previous layer, this is due to larger spacing between pixels on pooling layers. Using the max-pooling function reduces the effect of outliers and improves generalization.

### Local Response Normalization

This network includes local response normalization layers which are implemented in lateral form and used on neurons with unbounded activations and permits the detection of high-frequency features with a big neuron response. This regularizer encourages competition among neurons belonging to different banks. Normalization is done by dividing the activity of a neuron in bank [math]i[/math] at position [math](x,y)[/math] by the equation:

where the sum runs over [math]N[/math] ‘adjacent’ banks of neurons at the same position as in the topographic organization of neuron bank. The constants, [math]N[/math], [math]alpha[/math] and [math]betas[/math] are hyper-parameters whose values are determined using a validation set. This technique is replaced by better techniques such as the combination of dropout and regularization methods ([math]L1[/math] and [math]L2[/math])### Neuron nonlinearities

All of the neurons for this model use the max-with-zero nonlinearity where output within a neuron is computed as [math] a^{i}_{x,y} = max(0, z^i_{x,y})[/math] where [math] z^i_{x,y} [/math] is the total input to the neuron. The reason they use nonlinearity is because it has several advantages over traditional saturating neuron models, such as significant reduction in training time required to reach a certain error rate. Another advantage is that nonlinearity reduces the need for contrast-normalization and data pre-processing since neurons do not saturate- meaning activities simply scale up little by little with usually large input values. For this model’s only pre-processing step, they subtract the mean activity from each pixel and the result is a centered data.

### Objective function

The objective function of their network maximizes the multinomial logistic regression objective which is the same as minimizing the average cross-entropy across training cases between the true label and the model’s predicted label.

### Weight Initialization

It’s important to note that if a neuron always receives a negative value during training, it will not learn because its output is uniformly zero under the max-with-zero nonlinearity. Hence, the weights in their model were sampled from a zero-mean normal distribution with a high enough variance. High variance in weights will set a certain number of neurons with positive values for learning to happen, and in practice, it’s necessary to try out several candidates for variances until a working initialization is found. In their experiment, setting a positive constant, or 1, as biases of the neurons in the hidden layers was helpful in finding it.

### Training

In this model, a batch size of 128 samples and momentum of 0.9, we train our model using stochastic gradient descent. The update rule for weight [math]w[/math] is $$ v_{i+1} = 0.9v_i + \epsilon <\frac{dE}{dw_i}> i$$ $$w_{i+1} = w_i + v_{i+1} $$ where [math]i[/math] is the iteration index, [math]v[/math] is a momentum variable, [math]\epsilon[/math] is the learning rate and [math]\frac{dE}{dw}[/math] is the average over the [math]i[/math]th batch of the derivative of the objective with respect to [math]w_i[/math]. The whole training process on CIFAR-10 takes roughly 90 minutes and ImageNet takes 4 days with dropout and two days without.

### Learning

To determine the learning rate for the network, it is a must to start with an equal learning rate for each layer which produces the largest reduction in the objective function with power of ten. Usually, it is in the order of [math]10^{-2}[/math] or [math]10^{-3}[/math]. In this case, they reduce the learning rate twice by a factor of ten before termination of training.

# CIFAR-10

### CIFAR-10 Dataset

CIFAR-10 is a popular object recognition dataset with size 32 x 32 color images searched from the web. It contains 10 classes and the images are labeled with the noun used to search the image. It has images of 6000 train images and 1000 test images of a single dominant object from the label name for each 10 classes.

### Models for CIFAR-10

They implemented two different models for CIFAR-10, one with dropout and the other without. Two models both have CNN with three convolutional layers each with a pooling layer. The max-pooling method is performed by the pooling layer which follows the first convolutional layer, and the average-pooling method is performed by remaining 2 pooling layers. The first and second pooling layers with [math]N = 9, α = 0.001[/math], and [math]β = 0.75[/math] are followed by response normalization layers. A ten-unit softmax layer, which is used to output a probability distribution over class labels, is connected with the upper-most pooling layer. Using filter size of 5×5, all convolutional layers have 64 filter banks.

Additional changes were made with the model with dropout. The one with dropout enables us to use more parameters because dropout forces a strong regularization on the network, and a fourth weight layer is added to take the input from the previous pooling layer. This fourth weight layer is locally connected but not convolutional and contains 16 banks of filters of size 3 × 3 with 50% dropout. Lastly, the softmax layer takes its input from this fourth weight layer.

Thus, with a neural network with 3 convolutional hidden layers with 3 max-pooling layers, the classification error achieved 16.6% to beat 18.5% from the best published error rate without using transformed data. Then, adding one locally-connected layer after these 6 layers and dropout at the last hidden layer produced the error rate of 15.6%.

# ImageNet

### ImageNet Dataset

ImageNet is a dataset of millions of high-resolution labeled images in thousands of categories which were collected from the web and labelled by human labellers using MTerk tool (Amazon’s Mechanical Turk crowd-sourcing tool). Because this dataset has millions of labeled images in thousands of categories, it is very difficult to have perfect accuracy on this dataset even for humans because the ImageNet images may contain multiple objects and there are a large number of object classes. ImageNet and CIFAR-10 are very similar, but the scale of ImageNet is about 20 times bigger (1,300,000 vs 60,000). The size of ImageNet is about 1.3 million training images, 50,000 validation images, and 150,000 testing images. They used resized images of 256 x 256 pixels for their experiments.

**An ambiguous example to classify:**

When this paper was written, the best score on this dataset was the error rate of 45.7% by High-dimensional signature compression for large-scale image classification (J. Sanchez, F. Perronnin, CVPR11 (2011)). The authors of this paper could achieve a comparable performance of 48.6% error using a single neural network with five convolutional hidden layers with a max-pooling layer in between, followed by two globally connected layers and a final 1000-way softmax layer. Also, the error rate of 42.4% could be achieved by using 50% dropout in the 6th hidden layer.

**ImageNet Dataset:**

### Models for ImageNet

The models for ImageNet with dropout (the one without dropout had a similar approach, but there was a serious issue with overfitting): They used a convolutional neural network trained by 224×224 patches randomly extracted from the 256 × 256 images. It can reduce the network’s capacity to overfit the training data and helps generalization as a form of data augmentation. The method of averaging the prediction of the net on ten 224 × 224 patches of the 256 × 256 input image was used for a testing (patched at the center, the four corner patches, and their horizontal reflections).

To maximize the performance on the validation set, this complicated network architecture was used and it was found that dropout was very effective. Also, it was demonstrated that using non-convolutional higher layers with the number of parameters worked well with dropout, but it had a negative impact to the performance without dropout.

It was demonstrated that making a large number of decisions was important for the architecture of the net design for the speech recognition (TIMIT) and object recognition datasets (CIFAR-10 and ImageNet). A separate validation set which evaluated the performance of a large number of different architectures was used to make those decisions, and then they chose the best performance architecture with dropout on the validation set so that they could apply it to the real test set.

# Conclusion

The authors have shown a consistent improvement by the models trained with dropout in classifying objects in the following datasets: MNIST; TIMIT; Reuters Corpus Volume I; CIFAR-10; and ImageNet.