Documente Academic
Documente Profesional
Documente Cultură
d=5
I
like
this
movie
very
much
!
Figure 1: Illustration of a CNN architecture for sentence classification. We depict three filter region sizes:
2, 3 and 4, each of which has 2 filters. Filters perform convolutions on the sentence matrix and generate
(variable-length) feature maps; 1-max pooling is performed over each map, i.e., the largest number from
each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these
6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer
then receives this feature vector as input and uses it to classify the sentence; here we assume binary
classification and hence depict two possible output states.
words from Google News (Mikolov et al., 2013), to perform poorly for sentence classification tasks.
while GloVe is a model based on global word- We believe that one-hot CNN may not be suit-
word co-occurrence statistics (Pennington et al., able for sentence classification when one has a
2014). We used a GloVe model trained on a cor- small to modestly sized training dataset, likely
pus of 840 billion tokens of web data. For both due to sparsity: the sentences are perhaps too
word2vec and GloVe we induce 300-dimensional brief to provide enough information for this high-
word vectors. We report results achieved using dimensional encoding. Alternative one-hot archi-
GloVe representations in Table 3. Here we only tectures might be more appropriate for this sce-
report non-static GloVe results (which again uni- nario. For example, Johnson and Zhang (Johnson
formely outperformed the static variant). and Zhang, 2015) propose a semi-supervised CNN
We also experimented with concatenating variant which first learns embeddings of small text
word2vec and GloVe representations, thus cre- regions from unlabeled data, and then integrates
ating 600-dimensional word vectors to be used them into a supervised CNN. We emphasize that
as input to the CNN. Pre-trained vectors may if training data is plentiful, learning embeddings
not always be available for specific words (either from scratch may indeed be best.
in word2vec or GloVe, or both); in such cases,
4.3 Effect of filter region size
we randomly initialized the corresponding sub-
vectors. Results are reported in the final column Region size MR
of Table 3. 1 77.85 (77.47,77.97)
3 80.48 (80.26,80.65)
The relative performance achieved using GloVe 5 81.13 (80.96,81.32)
versus word2vec depends on the dataset, and, un- 7 81.65 (81.45,81.85)
10 81.43 (81.28,81.75)
fortunately, simply concatenating these represen- 15 81.26 (81.01,81.43)
tations does necessarily seem helpful. Practically, 20 81.06 (80.87,81.30)
25 80.91 (80.73,81.10)
our results suggest experimenting with different 30 80.91 (80.72,81.05)
pre-trained word vectors for new tasks.
We also experimented with using long, sparse Table 4: Effect of single filter region size. Due to
one-hot vectors as input word representations, in space constraints, we report results for only one
the spirit of Johnson and Zhang (2014). In this dataset here, but these are generally illustrative.
strategy, each word is encoded as a one-hot vec- We first explore the effect of filter region size
tor, with dimensionality equal to the vocabulary when using only one region size, and we set the
size. Though this representation combined with number of feature maps for this region size to 100
one-layer CNN achieves good results on docu- (as in the baseline configuration). We consider re-
ment classification, it is still unknown whether gion sizes of 1, 3, 5, 7, 10, 15, 20, 25 and 30, and
this is useful for sentence classification. We keep record the means and ranges over 10 replications
the other settings the same as in the basic con- of 10-fold CV for each. We report results in Ta-
figuration, and the one-hot vector is fixed during ble 10 and Fig. 3. Because we are only interested
training. Compared to using embeddings as in- in the trend of the accuracy as we alter the region
put to the CNN, we found the one-hot approach size (rather than the absolute performance on each
MR SST-1 SST-2 Subj TREC CR MPQA Opi Opi Multiple region size Accuracy (%)
2 (7) 81.65 (81.45,81.85)
(3,4,5) 81.24 (80.69, 81.56)
1
(4,5,6) 81.28 (81.07,81.56)
0 (5,6,7) 81.57 (81.31,81.80)
Change in accuracy (%)
configurations. This result held across all datasets. MR SST-1 SST-2 Subj TREC CR MPQA Opi Irony
1
We also considered a k-max pooling strategy
similar to (Kalchbrenner et al., 2014), in which the 0
maximum k values are extracted from the entire
4.7 Effect of regularization MR SST-1 SST-2 Subj TREC CR MPQA Opi Irony
1.0
Two common regularization strategies for CNNs
are dropout and l2 norm constraints; we explore
0.5
the effect of these here. ‘Dropout’ is applied to the
Change in accuracy (%)
1
eters than multi-layer deep learning models. An-
other possible explanation is that using word em-
2
beddings helps to prevent overfitting (compared
3 to bag of words based encodings). However, we
are not advocating completely foregoing regular-
4 ization. Practically, we suggest setting the dropout
None 0.0 0.1 0.3 0.5 0.7 0.9
Dropout rate when feature map is 500
rate to a small value (0.0-0.5) and using a rela-
tively large max norm constraint, while increasing
Figure 7: Effect of dropout rate when using 500
the number of feature maps to see whether more
feature maps.
features might help. When further increasing the
number of feature maps seems to degrade perfor-
is almost the same as when the number of feature
mance, it is probably worth increasing the dropout
maps is 100, and it does not help much. But we
rate.
observe that for the dataset SST-1, dropout rate ac-
tually helps when it is 0.7. Referring to Fig. 4, we 5 Conclusions
can see that when the number of feature maps is
larger than 100, it hurts the performance possibly We have conducted an extensive experimental
due to overfitting, so it is reasonable that in this analysis of CNNs for sentence classification. We
case dropout would mitigate this effect. conclude here by summarizing our main findings
We also experimented with applying dropout and deriving from these practical guidance for re-
only to the convolution layer, but still setting the searchers and practitioners looking to use and de-
max norm constraint on the classification layer to ploy CNNs in real-world sentence classification
3, keeping all other settings exactly the same. This scenarios.
means we randomly set elements of the sentence
matrix to 0 during training with probability p, and 5.1 Summary of Main Empirical Findings
then multiplied p with the sentence matrix at test • Prior work has tended to report only the mean
time. The effect of dropout rate on the convolu- performance on datasets achieved by models.
tion layer is shown in Fig. 8. Again we see that But this overlooks variance due solely to the
dropout on the convolution layer helps little, and stochastic inference procedure used. This can
large dropout rate dramatically hurts performance. be substantial: holding everything constant
To summarize, contrary to some of the existing (including the folds), so that variance is due
literature e (Srivastava et al., 2014), we found that exclusively to the stochastic inference proce-
dropout had little beneficial effect on CNN perfor- dure, we find that mean accuracy (calculated
mance. We attribute this observation to the fact via 10 fold cross-validation) has a range of
up to 1.5 points. And the range over the AUC • Line-search over the single filter region size
achieved on the irony dataset is even greater to find the ‘best’ single region size. A rea-
– up to 3.4 points (see Table 3). More replica- sonable range might be 1∼10. However, for
tion should be performed in future work, and datasets with very long sentences like CR, it
ranges/variances should be reported, to pre- may be worth exploring larger filter region
vent potentially spurious conclusions regard- sizes. Once this ‘best’ region size is iden-
ing relative model performance. tified, it may be worth exploring combining
multiple filters using regions sizes near this
• We find that, even when tuning them to the single best size, given that empirically multi-
task at hand, the choice of input word vector ple ‘good’ region sizes always outperformed
representation (e.g., between word2vec and using only the single best region size.
GloVe) has an impact on performance, how-
ever different representations perform better • Alter the number of feature maps for each fil-
for different tasks. At least for sentence clas- ter region size from 100 to 600, and when this
sification, both seem to perform better than is being explored, use a small dropout rate
using one-hot vectors directly. We note, how- (0.0-0.5) and a large max norm constraint.
ever, that: (1) this may not be the case if Note that increasing the number of feature
one has a sufficiently large amount of train- maps will increase the running time, so there
ing data, and, (2) the recent semi-supervised is a trade-off to consider. Also pay atten-
CNN model proposed by Johnson and Zhang tion whether the best value found is near the
(Johnson and Zhang, 2015) may improve per- border of the range (Bengio, 2012). If the
formance, as compared to the simpler version best value is near 600, it may be worth trying
of the model considered here (i.e., proposed larger values.
in (Johnson and Zhang, 2014)).
• Consider different activation functions if pos-
• The filter region size can have a large effect sible: ReLU and tanh are the best overall can-
on performance, and should be tuned. didates. And it might also be worth trying
no activation function at all for our one-layer
• The number of feature maps can also play CNN.
an important role in the performance, and in-
creasing the number of feature maps will in- • Use 1-max pooling; it does not seem neces-
crease the training time of the model. sary to expend resources evaluating alterna-
tive strategies.
• 1-max pooling uniformly outperforms other
pooling strategies. • Regarding regularization: When increasing
the number of feature maps begins to reduce
• Regularization has relatively little effect on performance, try imposing stronger regular-
the performance of the model. ization, e.g., a dropout out rate larger than
5.2 Specific advice to practitioners 0.5.
Drawing upon our empirical results, we provide • When assessing the performance of a model
the following guidance regarding CNN architec- (or a particular configuration thereof), it is
ture and hyperparameters for practitioners looking imperative to consider variance. Therefore,
to deploy CNNs for sentence classification tasks. replications of the cross-fold validation pro-
cedure should be performed and variances
• Consider starting with the basic configura- and ranges should be considered.
tion described in Table 2 and using non-static
word2vec or GloVe rather than one-hot vec- Of course, the above suggestions are applicable
tors. However, if the training dataset size only to datasets comprising sentences with simi-
is very large, it may be worthwhile to ex- lar properties to the those considered in this work.
plore using one-hot vectors. Alternatively, And there may be examples that run counter to our
if one has access to a large set of unlabeled findings here. Nonetheless, we believe these sug-
in-domain data, (Johnson and Zhang, 2015) gestions are likely to provide a reasonable start-
might also be an option. ing point for researchers or practitioners looking
to apply a simple one-layer CNN to real world of the 27th International Conference on Machine
sentence classification tasks. We emphasize that Learning (ICML-10), pages 111–118.
we selected this simple one-layer CNN in light of
[Boureau et al.2011] Y-Lan Boureau, Nicolas Le Roux,
observed strong empirical performance, which po- Francis Bach, Jean Ponce, and Yann LeCun. 2011.
sitions it as a new standard baseline model akin to Ask the locals: multi-way local pooling for im-
bag-of-words SVM and logistic regression. This age recognition. In Computer Vision (ICCV), 2011
approach should thus be considered prior to im- IEEE International Conference on, pages 2651–
2658. IEEE.
plementation of more sophisticated models.
We have attempted here to provide practical, [Breuel2015] Thomas M Breuel. 2015. The effects of
empirically informed guidance to help data sci- hyperparameters on sgd training of neural networks.
ence practitioners find the best configuration for arXiv preprint arXiv:1508.02788.
this simple model. We recognize that manual and
[Chen and Manning2014] Danqi Chen and Christo-
grid search over hyperparameters is sub-optimal, pher D Manning. 2014. A fast and accurate depen-
and note that our suggestions here may also in- dency parser using neural networks. In Proceedings
form hyperparameter ranges to explore in random of the 2014 Conference on Empirical Methods in
search or Bayesian optimization frameworks. Natural Language Processing (EMNLP), volume 1,
pages 740–750.
6 Acknowledgments
[Coates et al.2011] Adam Coates, Andrew Y Ng, and
This work was supported in part by the Army Re- Honglak Lee. 2011. An analysis of single-layer
search Office (grant W911NF-14-1-0442) and by networks in unsupervised feature learning. In In-
ternational conference on artificial intelligence and
The Foundation for Science and Technology, Por- statistics, pages 215–223.
tugal (grant UTAP-EXPL/EEIESS/0031/2014).
This work was also made possible by the support [Collobert and Weston2008] Ronan Collobert and Ja-
of the Texas Advanced Computer Center (TACC) son Weston. 2008. A unified architecture for natu-
ral language processing: Deep neural networks with
at UT Austin.
multitask learning. In Proceedings of the 25th in-
We thank Tong Zhang and Rie Johnson for help- ternational conference on Machine learning, pages
ful feedback. 160–167. ACM.
Table 10: Effect of single filter region size using non-static CNN.
Table 11: Effect of single filter region size using static CNN.
Table 12: Performance of number of feature maps for each filter using non-static word2vec-CNN
Table 13: Effect of number of feature maps for each filter using static word2vec-CNN
Sigmoid tanh Softplus Iden ReLU Cube tahn-cube
MR 80.51 (80.22, 80.77) 81.28 (81.07, 81.52) 80.58 (80.17, 81.12) 81.30 (81.09, 81.52) 81.16 (80.81, 83.38) 80.39 (79.94,80.83) 81.22 (80.93,81.48)
SST-1 45.83 (45.44, 46.31) 47.02 (46.31, 47.73) 46.95 (46.43, 47.45) 46.73 (46.24,47.18) 47.13 (46.39, 47.56) 45.80 (45.27,46.51) 46.85 (46.13,47.46)
SST-2 84.51 (84.36, 84.63) 85.43 (85.10, 85.85) 84.61 (84.19, 84.94) 85.26 (85.11, 85.45) 85.31 (85.93, 85.66) 85.28 (85.15,85.55) 85.24 (84.98,85.51)
Subj 92.00 (91.87, 92.22) 93.15 (92.93, 93.34) 92.43 (92.21, 92.61) 93.11 (92.92, 93.22) 93.13 (92.93, 93.23) 93.01 (93.21,93.43) 92.91 (93.13,93.29)
TREC 89.64 (89.38, 89.94) 91.18 (90.91, 91.47) 91.05 (90.82, 91.29) 91.11 (90.82, 91.34) 91.54 (91.17, 91.84) 90.98 (90.58,91.47) 91.34 (90.97,91.73)
CR 82.60 (81.77, 83.05) 84.28 (83.90, 85.11) 83.67 (83.16, 84.26) 84.55 (84.21, 84.69) 83.83 (83.18, 84.21) 84.16 (84.47,84.88) 83.89 (84.34,84.89)
MPQA 89.56 (89.43, 89.78) 89.48 (89.16, 89.84) 89.62 (89.45, 89.77) 89.57 (89.31, 89.88) 89.35 (88.88, 89.58) 88.66 (88.55,88.77) 89.45 (89.27,89.62)
Table 22: Effect of dropout rate when feature map is 500 using non-static word2vec-CNN
0.0 0.1 0.3 0.5 0.7 0.9
MR 81.16 (80.80 81.57 ) 81.19 (80.98 81.46 ) 81.13 (80.58 81.58 ) 81.08 (81.01 81.13 ) 81.06 (80.49 81.48 ) 80.05 (79.92 80.37)
SST-1 45.97 (45.65 46.43 ) 46.19 (45.71 46.64 ) 46.28 (45.83 46.93 ) 46.34 (46.04 46.98 ) 44.22 (43.87 44.78 ) 43.15 (42.94 43.32)
SST-2 85.50 (85.46 85.54 ) 85.62 (85.56 85.72 ) 85.47 (85.19 85.58 ) 85.35 (85.06 85.52 ) 85.02 (84.64 85.31 ) 84.14 (83.86 84.51)
Subj 93.21 (93.13 93.31 ) 93.19 (93.07 93.34 ) 93.20 (93.03 93.39 ) 92.67 (92.40 92.98 ) 91.27 (91.16 91.43 ) 88.46 (88.19 88.62)
TREC 91.41 (91.22 91.66 ) 91.62 (91.51 91.70 ) 91.56 (91.46 91.68 ) 91.41 (91.01 91.64 ) 91.03 (90.82 91.23 ) 86.63 (86.15 86.90)
CR 84.21 (83.81 84.62 ) 83.88 (83.54 84.11 ) 83.97 (83.73 84.16 ) 83.97 (83.75 84.18 ) 83.47 (82.86 83.72 ) 79.79 (78.89 80.38 )
MPQA 89.40 (89.15 89.56 ) 89.45 (89.26 89.60 ) 89.14 (89.08 89.20 ) 88.86 (88.70 89.05 ) 87.88 (87.71 88.18 ) 83.96 (83.76 84.12)
Table 23: Effect of dropout rate on convolution layer using non-static word2vec-CNN