Sunteți pe pagina 1din 5

Pruning a BERT-based Question Answering Model

J.S. McCarley
IBM T.J. Watson Research Center
Yorktown Heights, NY

Abstract worst-case performance loss, for speed up tech-

niques based on quantization, (Shen et al., 2019)
We investigate compressing a BERT-based
while the difficulty of distilling a SQuAD model
question answering system by pruning pa-
arXiv:1910.06360v1 [cs.CL] 14 Oct 2019

rameters from the underlying BERT model. (compared to sentence-level GLUE tasks) is ac-
We start from models trained for SQuAD knowledged in (Jiao et al., 2019). We speculate
2.0 and introduce gates that allow selected that these difficulties are because answer selection
parts of transformers to be individually elimi- requires token level rather than passage level an-
nated. Specifically, we investigate (1) reducing notation, and the need for long range attention be-
the number of attention heads in each trans- tween query and passage.
former, (2) reducing the intermediate width of
In this paper we investigate pruning three as-
the feed-forward sublayer of each transformer,
and (3) reducing the embedding dimension. pects of BERT:
We compare several approaches for determin- (1) the number of attention heads nH ,
ing the values of these gates. We find that (2) the intermediate size dI
a combination of pruning attention heads and (3) the embedding or hidden dimension dE .
the feed-forward layer almost doubles the de-
The contributions of this paper are (1) appli-
coding speed, with only a 1.5 f-point loss in
cation of structured pruning techniques to the
feed-forward layer and the hidden dimension of
1 Introduction the transformers, not just the attention heads, (2)
thereby significantly pruning BERT with minimal
The recent surge in NLP model complexity has loss of accuracy on a question answering task,
outstripped Moore’s law. 1 (Peters et al., considerable speedup, all without the expense of
2018; Devlin et al., 2018; Narasimhan, 2019) revisiting pretraining, and (3) surveying multi-
Deeply stacked layers of transformers (including ple pruning techniques (both heuristic and train-
BERT, RoBERTa (Liu et al., 2019), XLNet (Yang able) and providing recommendations specific to
et al., 2019b), and ALBERT (Lan et al., 2019)) transformer-based question answering models.
have greatly improved state-of-the-art accuracies
Widely distributed pre-trained models consist of
across a variety of NLP tasks, but the compu-
typically 12-24 layers of identically sized trans-
tational intensity raises concerns in the cloud-
formers. We will see that an optimal prun-
computing economy. Numerous techniques devel-
ing yields non-identical transformers, namely
oped to shrink neural networks including distilla-
lightweight transformers near the top and bottom
tion, quantization , and pruning are now being ap-
while retaining more complexity in the intermedi-
plied to transformers.
ate layers.
Question answering, in particular, has immedi-
ate applications in real-time systems. Question an-
2 Related work
swering has seen striking gains in accuracy due
to transformers, as measured on the SQuAD (Ra- While distillation (student-teacher) of BERT has
jpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar produced notably smaller models, (Tang et al.,
et al., 2018) leaderboards. SQuAD is seen as a 2019; Turc et al., 2019; Tsai et al., 2019; Yang
ELMO: 93 × 106 , BERT: 340 × 106 , Megatron: 8300 × et al., 2019a), the focus has been on sentence level
106 parameters annotation tasks that do not require long-range
notation dimension base large In each self-attention sublayer, we place a mask,
nL layers 12 24 Γattn of size nH which selects attention heads to
dE embeddings 768 1024 remain active. (section 3.2.2)
nH attention heads 12 16 In each feed-forward sublayer, we place a mask,
dI intermediate size 3072 4096 ff
Γ of size dI which selects ReLU/GeLU activa-
tions to remain active. (section 3.3)
Figure 1: Notation: important dimensions of a BERT The final mask, Γemb , of size dE , selects which
model embedding dimensions, (section 3.4) remain ac-
tive. This gate is applied identically to both input
attention. Revisiting the pretraining phase dur- and residual connections in each transformer.
ing distillation is often a significant requirement. 3.3 Determining Gate Values
DistilBERT (Sanh et al., 2019) reports modest
speedup and small performance loss on SQuAD We investigate four approches to determining the
1.1. TinyBERT (Jiao et al., 2019) restricts SQuAD gate values.
evaluation to using BERT-base as a teacher, and (1) “random:” each γi is sampled from a
defers deeper investigation to future work. Bernoulli distribution of parameter p, where p is
Our work is perhaps most similar to (Fan et al., manually adjusted to control the sparsity
2019), an exploration of pruning as a form of (2) ”gain:” We follow the method of (Michel
dropout. They prune entire layers of BERT, et al., 2019) and estimate the influence of each gate
but suggest that smaller structures could also be γi on the training set likelihood L by computing
pruned. They evaluate on MT, language modeling, the mean value of

and generation-like tasks, but not SQuAD. L0 reg- ∂L
gi =
ularization was combined with matrix factoriza- ∂γi
tion to prune transformers in (Wang et al., 2019). (“head importance score”) during one pass over
Gale et al. (2019) induced unstructured sparsity the training data. We threshold gi to determine
on a transformer-based MT model, but did not re- which transformer slices to retain.
port speedups. Voita et al. (2019) focused on lin- (3) ”leave-one-out:” We again follow the
guistic interpretability of attention heads and in- method of (Michel et al., 2019) and evaluate the
troduced L0 regularization to BERT, but did not impact on devset f score of a system with exactly
report speedups. Kovaleva et al. (2019) also fo- one gate set to zero: Note that this procedure re-
cused on interpreting attention, and achieved small quires nL × nH passes through the data. We con-
accuracy gains on GLUE tasks by disabling (but trol the sparsity during decoding by retaining those
not pruning) certain attention heads. Michel et gates for which δfi is large.
al. (2019) achieved speedups on MT and MNLI by (4) “L0 regularization:” Following the method
gating only the attention with simple heuristics. described in (Louizos et al., 2017), during training
time the gate variables γi are sampled from a hard-
3 Pruning transformers
concrete distribution (Maddison et al., 2017) pa-
3.1 Notation rameterized by a corresponding variable αi ∈ R.
The size of a BERT model is characterized by the The task-specific objective function is penalized
values in table 1. in proportion to the expected number instances of
γ = 1. Proportionality constants λattn , λff , and
3.2 Gate placement λemb in the penalty terms are manually adjusted
Our approach to pruning each aspect of a trans- to control the sparsity. We resample the γi with
former is similar. We insert three masks into each each minibatch. We note that the full objective
transformer. Each mask is a vector of gate vari- function is differentiable with respect to the αi be-
ables γi ∈ [0, 1], where γi = 0 indicates a slice of cause of the reparameterization trick. (Kingma
transformer parameters to be pruned, and γi = 1 and Welling, 2014; Rezende et al., 2014) The αi
indicates a slice to remain active. We describe the are updated by backpropgation for one training
placement of each mask following the terminology epoch with the SQuAD training data, with all other
of (Vaswani et al., 2017), indicating the relevant paramaters held fixed. The final values for the
sections of that paper. gates γi are obtained by thresholding the αi .
Figure 2: f1 vs percentage of attention heads pruned Figure 4: f1 vs percentage of embedding dimensions

Figure 3: f1 vs percentage of feed-forward activations Figure 5: Percentage of attention heads and feed for-
pruned ward activations remaining after pruning, by layer

3.4 Pruning that provides a baseline performance of 75.0f1 on

the 10% dataset. 2
After the values of the γi have been determined
Our validation experiments use the standard
by one of the above methods, the model is pruned.
training/dev configuration of SQuAD 2.0. All
Attention heads corresponding to γiattn = 0 are
are initialized from system that has an accuracy
removed. Slices of the feed forward linear trans-
of 84.6f1 on the official dev set. (Glass et al.,
formations corresponding to γiff = 0 are re-
2019) (This model was initialized from bert-large-
moved. The pruned model no longer needs masks,
and now consists of transformers of varying, non-
identical sizes. The gate parameters of the ”L0 regularization”
experiments are trained for one epoch starting
We note that task-specific training of all BERT
from the models above, with all transformer and
parameters may be continued further with the
embedding parameters fixed. The cost of train-
pruned model.
ing the gate parameters is comparable to extend-
4 Experiments ing fine tuning for an additional epoch. We inves-
tigated learning rates of 10−3 , 10−2 , and 10−1 on
For development experiments (learning rate base-qa, and chose the latter for presentation and
penalty weight exploration), and in order to min- results on large-qa. This is notably larger than typ-
imize overuse of the official dev-set, we use 90% ical learning rates to tune BERT parameters. We
of the official SQuAD 2.0 training data for train- used a minibatch size of 24, otherwise default hy-
ing gates, and report results on the remaining 10%.
Our development experiments (base-qa) are all Our baseline SQuAD model depends upon code
distributed by
initialized from a SQuAD 2.0 system initialized transformers, and incorporates either bert-base-uncased
from bert-base-uncased and trained on the 90% or bert-large-uncased with a standard task-specific head.
model time (sec) f1 attn-prune ff-prune size (MiB)
no pruning 2605 84.6 0 0 1278
attn1 2253 84.2 44.3 0 1110
ff1 2078 83.2 0 47.7 909
ff1 + attn1 1631 82.6 44.3 47.7 741
ff2 + attn2 1359 80.9 52.6 65.2 575
ff2 + attn2 + retrain 1349 83.2 575

Table 1: Decoding times, accuracies, and space savings achieved by two sample operating points on large-qa

perparameters of the BERT-Adam optimizer. We eter values. We note that much of the performance
used identical parameters for out large-qa exper- loss can be recovered by continuing the training
iments, except with gradaccsteps=3. Tables re- for an additional epoch after the pruning.
port median values across 5 random seeds; graphs The speedup in decoding due to pruning the
overplot results for 5 seeds. model is not simply proportional to the amount
pruned. There are computations in both the at-
4.1 Accuracy as function of pruning tention and feed-forward part of each transformer
In figure 2 we plot the accuracy of base-qa f1 ac- layer that necessarily remain unpruned, for exam-
curacy as a function of the percentage of heads ple layer normalization.
removed. As expected, the performance of ”ran-
dom” decays most abruptly. ”Leave-one-out” and 4.3 Impact of pruning each layer
“Gain” are better, but substantially similar. “L0 In Fig. 5 we show the percentage of attention
regularization” is best, allowing 50% pruning at a heads and feed forward activations remaining after
cost of 5 f-points. pruning, by layer. We see that intermedate layers
Also in figure 3 we plot the accuracy f1 accu- retained more, while layers close to the embed-
racy of removing activations. We see broadly sim- ding and close to the answer were pruned more
ilar trends as above, except that the performance heavily.
is robust to even larger pruning. “Leave-one-out”
require a prohibitive number of passes (nL × dI ) 5 Conclusion
through the data.
We investigate various methods to prune
In figure 4 we plot the accuracy for removing
transformer-based models, and evaluate the
embedding dimensions. We see that performance
accuracy-speed tradeoff for this pruning. We find
falls much more steeply with the removal of em-
that both the attention heads and especially the
bedding dimensions. Attempts to train “L0 regu-
feed forward layer can be pruned considerably
larization” were unsuccessfully - we speculate that
with minimal lost of accuracy, while pruning
the strong cross-layer coupling may necessitate a
the embedding/hidden dimension is ineffective
different learning rate schedule.
because of a loss in accuracy. We find that L0
4.2 Validating these results regularization pruning, when successful, is con-
siderably more effective than heuristic methods.
On the basis of the development experiments, we
We also find that pruning the feed-forward layer
select operating points (values of λattn and λff ) and the attention heads can be easily combined,
and train the gates of large-qa with these penal- and, especially after retraining, yield a consid-
ties. The decoding times, accuracies, and model erably faster question answering model with
sizes are summarized in table 1. Models in which minimal loss in accuracy.
both attention and feed forward components are
pruned were produced by combining the indepen-
dently trained gate configurations of attention and References
feed forward. For the same parameters values,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
the large model is pruned somewhat less than the Kristina Toutanova. 2018. BERT: pre-training of
small model. We also note that the f1 loss due to deep bidirectional transformers for language under-
pruning is somewhat smaller, for the same param- standing. CoRR, abs/1810.04805.
Angela Fan, Edouard Grave, and Armand Joulin. 2019. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,
Reducing transformer depth on demand with struc- and Percy Liang. 2016. Squad: 100, 000+ ques-
tured dropout. tions for machine comprehension of text. CoRR,
Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The
state of sparsity in deep neural networks. CoRR, Danilo Jimenez Rezende, Shakir Mohamed, and Daan
abs/1902.09574. Wierstra. 2014. Stochastic backpropagation and ap-
proximate inference in deep generative models. In
Michael Glass, Alfio Gliozzo, Rishav Chakravarti, An- Proceedings of the 31st International Conference
thony Ferritto, Lin Pan, G P Shrivatsa Bhargav, Di- on Machine Learning, volume 32 of Proceedings of
nesh Garg, and Avirup Sil. 2019. Span selection pre- Machine Learning Research, pages 1278–1286, Be-
training for question answering. jing, China. PMLR.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Victor Sanh, Lysandre Debut, Julien Chaumond, and
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Thomas Wolf. 2019. Distilbert, a distilled version
2019. Tinybert: Distilling bert for natural language of bert: smaller, faster, cheaper and lighter.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei
Yao, Amir Gholami, Michael W. Mahoney, and Kurt
Diederik P. Kingma and Max Welling. 2014. Auto-
Keutzer. 2019. Q-bert: Hessian based ultra low pre-
encoding variational Bayes. International Confer-
cision quantization of bert.
ence on Learning Representations.
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Vechtomova, and Jimmy Lin. 2019. Distilling task-
Anna Rumshisky. 2019. Revealing the dark secrets specific knowledge from bert into simple neural net-
of bert. CoRR. works.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari-
Kevin Gimpel, Piyush Sharma, and Radu Soricut. vazhagan, Xin Li, and Amelia Archer. 2019. Small
2019. Albert: A lite bert for self-supervised learn- and practical bert models for sequence labeling.
ing of language representations.
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Toutanova. 2019. Well-read students learn better:
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, On the importance of pre-training compact models.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
approach. CoRR, abs/1907.11692. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Christos Louizos, Max Welling, and Diederik P. you need.
Kingma. 2017. Learning sparse neural networks
through l0 regularization. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-head
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. self-attention: Specialized heads do the heavy lift-
2017. The concrete distribution: A continuous re- ing, the rest can be pruned. CoRR, abs/1905.09418.
laxation of discrete random variables. In 5th Inter- Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019.
national Conference on Learning Representations, Structured pruning of large language models.
ICLR 2017, Toulon, France, April 24-26, 2017, Con-
ference Track Proceedings. Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and
Daxin Jiang. 2019a. Model compression with multi-
Paul Michel, Omer Levy, and Graham Neubig. 2019. task knowledge distillation for web-scale question
Are sixteen heads really better than one? CoRR, answering system.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Shar Narasimhan. 2019. Nvidia clocks worlds fastest Carbonell, Ruslan Salakhutdinov, and Quoc V.
bert training time and largest transformer based Le. 2019b. Xlnet: Generalized autoregressive
model, paving path for advanced conversational ai. pretraining for language understanding. CoRR,
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke A Supplemental Material
Zettlemoyer. 2018. Deep contextualized word rep-

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.

Know what you don’t know: Unanswerable ques-
tions for squad. CoRR, abs/1806.03822.