Sunteți pe pagina 1din 2

Improving neural networks by preventing co-adaptatio

of feature detectors
G. E. Hinton∗ , N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov

This paper deals with the overfitting of a neural network when the relationship between the
input and output is complex, and the training data is limited. This scenario provokes the feature
detectors of hidden layers to be tuned collectively to perform better on training data, and
worse on test data. A better solution to this problem involves random dropout of hidden layers
from a set of different networks whose outputs are averaged to obtain better result on test
data; thus the procedure will be less computationally expensive. The authors have used the
conventional stochastic gradient descent, however, the novelty lies in setting an upperbound
constraint of the weights rather than penalizing the L2 norm of the weight vector. During the
testing mode, the outgoing weights of the hidden layers are intentionally halved as the number
of the active hidden layers is twice. This is equivalent to consider the geometric mean of the
probability distributions of the all possible predicted output labels. The so called " mean
network" will provide more probabilities to the correct output than the average the of
individual dropout networks. The authors claimed that the additional randomness of the
dropout network will improve the error rate by at least 31% compared to the standalone
standard feedforward neural network while operating on MNIST dataset. Similar performances
are observed in solving different datasets: 20% improvement in classification on TIMIT, only
15.6% error in CIFAR-10, record breaking performance on ImageNet with 42.4% error rate,
29.62% error rate down from 31.05% on the Reuters dataset. All these analysis have been done
using 50% dropout probabilities, since according to the authors, more extreme probability will
worsen the output. Moreover, using 50% dropout ensures equal importance to the all
networks. However, individual dropout for each hidden layer is also possible, and improves the
performance but demands higher resources. The competitor of this approach is the averaging
of the Bayesian probabilities. A suitable alternative to the Bayesian averaging is the 'bagging'
and one can think the dropout technique as the extreme version of the begging where each
parameter of a model shares the parameters of all models used.
Though the solution paradigm requires more insight to the problems specified in this
monograph, this work can be thought as an additional step to solve the overfitting problems in
machine learning. It is obvious that the effectiveness of this scheme indicated in the article is
quite catchy, however, an ablated study is required to inspect the impact of different dropout
and individual dropout on the performance. In addition, an analysis on the trade-off between
the additional requirements of implementing this scheme and the output accuracy is missing in
the text. The behavior and impact of the dropout regularizer in different neural network like
CNN is missing. It is worth to study the effect of dropout in convolutional layers.
Q1. What is the relation between dropout rate and the learning rate?

Q2. What happens if one uses dropout only for some layers instead of all layers and vice versa?
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler and Rob Fergus

This manuscript discusses about the technique to put a deep insight to the working procedure
of a convolutional neural network. Though the availability of larger dataset, powerful
computing machineries, and impressive regularizer make the convolutional network to perform
better than others in many context, however, a large portion the internal operation of this
network is still unknown. This work is a step towards to have a look inside the network
operation. This will also make possible to check the evolution process and bring any
modification if necessary to improve further. The visualization process was carried out by a
multidimensional deconvolutional network. The authors also perform the sensitivity analysis to
figure out the significant part of an image for classification. Visualizing in higher layers is
complex due to the inability to project pixel spaces. Different methods are devised to solve this
but fails to provide any information about the invariances. The authors develop a non-
parametric view of the invariances opposed to the quadratic approximation of it. To understand
the operation of a cconvolutional network, the authors first applied a standard convolutional
neural network where the input is a set of 2-D images, and the output is the class label. The
model is trained with a set of images and the parameters are optimized accordingly using
gradient descent. Training is done with the Imagenet dataset with Dropout regularizer. A
deconvoutional network is generated to probe the trained convolutional network to get a clear
picture of its operation. One can think the deconvolutional network as a convolutional network
with same organs but in reverse. A decovnet is attached to each layer of the covnet to observe
the phenomena. Afterwards, unpooling, rectification and filtering are carried out. Though the
max-pooling in covnet is non-invertible, unpooling is done by recording the location of the
maxima and reconstruction. Rectification is done by passing the reconstructed signal through a
relu non-linearity. For filtering, decovnet uses transposed version of the filters used by a
covnet. In addition to the visualization done by this model, it also aids in selecting the
architecture better for problems.
By understanding the covnet's operation by decovnet technique, the authors devised an
architecture suitable for ImageNet dataset and the error rate is reported as 14.8% which is a
record. This model are also generalized for other datasets such as, for Caltech-101, the accuracy
is 86.5%, Caltech-256 (74.2%), and for PASCAL 2012 the performance is only 3.2% lower than
the winner. The authors created a novel mechanism to visualize the internal operation of the
covnet and generalizes the model for all other datasets.

Q1. How does the decovolution occur in this network?

S-ar putea să vă placă și