0 evaluări0% au considerat acest document util (0 voturi)

1 vizualizări5 pagini491517156

Oct 16, 2019

777777

© © All Rights Reserved

PDF, TXT sau citiți online pe Scribd

491517156

© All Rights Reserved

0 evaluări0% au considerat acest document util (0 voturi)

1 vizualizări5 pagini777777

491517156

© All Rights Reserved

Sunteți pe pagina 1din 5

J.S. McCarley

IBM T.J. Watson Research Center

Yorktown Heights, NY

jsmc@us.ibm.com

niques based on quantization, (Shen et al., 2019)

We investigate compressing a BERT-based

while the difficulty of distilling a SQuAD model

question answering system by pruning pa-

arXiv:1910.06360v1 [cs.CL] 14 Oct 2019

rameters from the underlying BERT model. (compared to sentence-level GLUE tasks) is ac-

We start from models trained for SQuAD knowledged in (Jiao et al., 2019). We speculate

2.0 and introduce gates that allow selected that these difficulties are because answer selection

parts of transformers to be individually elimi- requires token level rather than passage level an-

nated. Specifically, we investigate (1) reducing notation, and the need for long range attention be-

the number of attention heads in each trans- tween query and passage.

former, (2) reducing the intermediate width of

In this paper we investigate pruning three as-

the feed-forward sublayer of each transformer,

and (3) reducing the embedding dimension. pects of BERT:

We compare several approaches for determin- (1) the number of attention heads nH ,

ing the values of these gates. We find that (2) the intermediate size dI

a combination of pruning attention heads and (3) the embedding or hidden dimension dE .

the feed-forward layer almost doubles the de-

The contributions of this paper are (1) appli-

coding speed, with only a 1.5 f-point loss in

accuracy.

cation of structured pruning techniques to the

feed-forward layer and the hidden dimension of

1 Introduction the transformers, not just the attention heads, (2)

thereby significantly pruning BERT with minimal

The recent surge in NLP model complexity has loss of accuracy on a question answering task,

outstripped Moore’s law. 1 (Peters et al., considerable speedup, all without the expense of

2018; Devlin et al., 2018; Narasimhan, 2019) revisiting pretraining, and (3) surveying multi-

Deeply stacked layers of transformers (including ple pruning techniques (both heuristic and train-

BERT, RoBERTa (Liu et al., 2019), XLNet (Yang able) and providing recommendations specific to

et al., 2019b), and ALBERT (Lan et al., 2019)) transformer-based question answering models.

have greatly improved state-of-the-art accuracies

Widely distributed pre-trained models consist of

across a variety of NLP tasks, but the compu-

typically 12-24 layers of identically sized trans-

tational intensity raises concerns in the cloud-

formers. We will see that an optimal prun-

computing economy. Numerous techniques devel-

ing yields non-identical transformers, namely

oped to shrink neural networks including distilla-

lightweight transformers near the top and bottom

tion, quantization , and pruning are now being ap-

while retaining more complexity in the intermedi-

plied to transformers.

ate layers.

Question answering, in particular, has immedi-

ate applications in real-time systems. Question an-

2 Related work

swering has seen striking gains in accuracy due

to transformers, as measured on the SQuAD (Ra- While distillation (student-teacher) of BERT has

jpurkar et al., 2016) and SQuAD 2.0 (Rajpurkar produced notably smaller models, (Tang et al.,

et al., 2018) leaderboards. SQuAD is seen as a 2019; Turc et al., 2019; Tsai et al., 2019; Yang

1

ELMO: 93 × 106 , BERT: 340 × 106 , Megatron: 8300 × et al., 2019a), the focus has been on sentence level

106 parameters annotation tasks that do not require long-range

notation dimension base large In each self-attention sublayer, we place a mask,

nL layers 12 24 Γattn of size nH which selects attention heads to

dE embeddings 768 1024 remain active. (section 3.2.2)

nH attention heads 12 16 In each feed-forward sublayer, we place a mask,

dI intermediate size 3072 4096 ff

Γ of size dI which selects ReLU/GeLU activa-

tions to remain active. (section 3.3)

Figure 1: Notation: important dimensions of a BERT The final mask, Γemb , of size dE , selects which

model embedding dimensions, (section 3.4) remain ac-

tive. This gate is applied identically to both input

attention. Revisiting the pretraining phase dur- and residual connections in each transformer.

ing distillation is often a significant requirement. 3.3 Determining Gate Values

DistilBERT (Sanh et al., 2019) reports modest

speedup and small performance loss on SQuAD We investigate four approches to determining the

1.1. TinyBERT (Jiao et al., 2019) restricts SQuAD gate values.

evaluation to using BERT-base as a teacher, and (1) “random:” each γi is sampled from a

defers deeper investigation to future work. Bernoulli distribution of parameter p, where p is

Our work is perhaps most similar to (Fan et al., manually adjusted to control the sparsity

2019), an exploration of pruning as a form of (2) ”gain:” We follow the method of (Michel

dropout. They prune entire layers of BERT, et al., 2019) and estimate the influence of each gate

but suggest that smaller structures could also be γi on the training set likelihood L by computing

pruned. They evaluate on MT, language modeling, the mean value of

and generation-like tasks, but not SQuAD. L0 reg- ∂L

gi =

(1)

ularization was combined with matrix factoriza- ∂γi

tion to prune transformers in (Wang et al., 2019). (“head importance score”) during one pass over

Gale et al. (2019) induced unstructured sparsity the training data. We threshold gi to determine

on a transformer-based MT model, but did not re- which transformer slices to retain.

port speedups. Voita et al. (2019) focused on lin- (3) ”leave-one-out:” We again follow the

guistic interpretability of attention heads and in- method of (Michel et al., 2019) and evaluate the

troduced L0 regularization to BERT, but did not impact on devset f score of a system with exactly

report speedups. Kovaleva et al. (2019) also fo- one gate set to zero: Note that this procedure re-

cused on interpreting attention, and achieved small quires nL × nH passes through the data. We con-

accuracy gains on GLUE tasks by disabling (but trol the sparsity during decoding by retaining those

not pruning) certain attention heads. Michel et gates for which δfi is large.

al. (2019) achieved speedups on MT and MNLI by (4) “L0 regularization:” Following the method

gating only the attention with simple heuristics. described in (Louizos et al., 2017), during training

time the gate variables γi are sampled from a hard-

3 Pruning transformers

concrete distribution (Maddison et al., 2017) pa-

3.1 Notation rameterized by a corresponding variable αi ∈ R.

The size of a BERT model is characterized by the The task-specific objective function is penalized

values in table 1. in proportion to the expected number instances of

γ = 1. Proportionality constants λattn , λff , and

3.2 Gate placement λemb in the penalty terms are manually adjusted

Our approach to pruning each aspect of a trans- to control the sparsity. We resample the γi with

former is similar. We insert three masks into each each minibatch. We note that the full objective

transformer. Each mask is a vector of gate vari- function is differentiable with respect to the αi be-

ables γi ∈ [0, 1], where γi = 0 indicates a slice of cause of the reparameterization trick. (Kingma

transformer parameters to be pruned, and γi = 1 and Welling, 2014; Rezende et al., 2014) The αi

indicates a slice to remain active. We describe the are updated by backpropgation for one training

placement of each mask following the terminology epoch with the SQuAD training data, with all other

of (Vaswani et al., 2017), indicating the relevant paramaters held fixed. The final values for the

sections of that paper. gates γi are obtained by thresholding the αi .

Figure 2: f1 vs percentage of attention heads pruned Figure 4: f1 vs percentage of embedding dimensions

removed

Figure 3: f1 vs percentage of feed-forward activations Figure 5: Percentage of attention heads and feed for-

pruned ward activations remaining after pruning, by layer

the 10% dataset. 2

After the values of the γi have been determined

Our validation experiments use the standard

by one of the above methods, the model is pruned.

training/dev configuration of SQuAD 2.0. All

Attention heads corresponding to γiattn = 0 are

are initialized from system that has an accuracy

removed. Slices of the feed forward linear trans-

of 84.6f1 on the official dev set. (Glass et al.,

formations corresponding to γiff = 0 are re-

2019) (This model was initialized from bert-large-

moved. The pruned model no longer needs masks,

uncased.)

and now consists of transformers of varying, non-

identical sizes. The gate parameters of the ”L0 regularization”

experiments are trained for one epoch starting

We note that task-specific training of all BERT

from the models above, with all transformer and

parameters may be continued further with the

embedding parameters fixed. The cost of train-

pruned model.

ing the gate parameters is comparable to extend-

4 Experiments ing fine tuning for an additional epoch. We inves-

tigated learning rates of 10−3 , 10−2 , and 10−1 on

For development experiments (learning rate base-qa, and chose the latter for presentation and

penalty weight exploration), and in order to min- results on large-qa. This is notably larger than typ-

imize overuse of the official dev-set, we use 90% ical learning rates to tune BERT parameters. We

of the official SQuAD 2.0 training data for train- used a minibatch size of 24, otherwise default hy-

ing gates, and report results on the remaining 10%.

2

Our development experiments (base-qa) are all Our baseline SQuAD model depends upon code

distributed by https://github.com/huggingface/

initialized from a SQuAD 2.0 system initialized transformers, and incorporates either bert-base-uncased

from bert-base-uncased and trained on the 90% or bert-large-uncased with a standard task-specific head.

model time (sec) f1 attn-prune ff-prune size (MiB)

no pruning 2605 84.6 0 0 1278

attn1 2253 84.2 44.3 0 1110

ff1 2078 83.2 0 47.7 909

ff1 + attn1 1631 82.6 44.3 47.7 741

ff2 + attn2 1359 80.9 52.6 65.2 575

ff2 + attn2 + retrain 1349 83.2 575

Table 1: Decoding times, accuracies, and space savings achieved by two sample operating points on large-qa

perparameters of the BERT-Adam optimizer. We eter values. We note that much of the performance

used identical parameters for out large-qa exper- loss can be recovered by continuing the training

iments, except with gradaccsteps=3. Tables re- for an additional epoch after the pruning.

port median values across 5 random seeds; graphs The speedup in decoding due to pruning the

overplot results for 5 seeds. model is not simply proportional to the amount

pruned. There are computations in both the at-

4.1 Accuracy as function of pruning tention and feed-forward part of each transformer

In figure 2 we plot the accuracy of base-qa f1 ac- layer that necessarily remain unpruned, for exam-

curacy as a function of the percentage of heads ple layer normalization.

removed. As expected, the performance of ”ran-

dom” decays most abruptly. ”Leave-one-out” and 4.3 Impact of pruning each layer

“Gain” are better, but substantially similar. “L0 In Fig. 5 we show the percentage of attention

regularization” is best, allowing 50% pruning at a heads and feed forward activations remaining after

cost of 5 f-points. pruning, by layer. We see that intermedate layers

Also in figure 3 we plot the accuracy f1 accu- retained more, while layers close to the embed-

racy of removing activations. We see broadly sim- ding and close to the answer were pruned more

ilar trends as above, except that the performance heavily.

is robust to even larger pruning. “Leave-one-out”

require a prohibitive number of passes (nL × dI ) 5 Conclusion

through the data.

We investigate various methods to prune

In figure 4 we plot the accuracy for removing

transformer-based models, and evaluate the

embedding dimensions. We see that performance

accuracy-speed tradeoff for this pruning. We find

falls much more steeply with the removal of em-

that both the attention heads and especially the

bedding dimensions. Attempts to train “L0 regu-

feed forward layer can be pruned considerably

larization” were unsuccessfully - we speculate that

with minimal lost of accuracy, while pruning

the strong cross-layer coupling may necessitate a

the embedding/hidden dimension is ineffective

different learning rate schedule.

because of a loss in accuracy. We find that L0

4.2 Validating these results regularization pruning, when successful, is con-

siderably more effective than heuristic methods.

On the basis of the development experiments, we

We also find that pruning the feed-forward layer

select operating points (values of λattn and λff ) and the attention heads can be easily combined,

and train the gates of large-qa with these penal- and, especially after retraining, yield a consid-

ties. The decoding times, accuracies, and model erably faster question answering model with

sizes are summarized in table 1. Models in which minimal loss in accuracy.

both attention and feed forward components are

pruned were produced by combining the indepen-

dently trained gate configurations of attention and References

feed forward. For the same parameters values,

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

the large model is pruned somewhat less than the Kristina Toutanova. 2018. BERT: pre-training of

small model. We also note that the f1 loss due to deep bidirectional transformers for language under-

pruning is somewhat smaller, for the same param- standing. CoRR, abs/1810.04805.

Angela Fan, Edouard Grave, and Armand Joulin. 2019. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev,

Reducing transformer depth on demand with struc- and Percy Liang. 2016. Squad: 100, 000+ ques-

tured dropout. tions for machine comprehension of text. CoRR,

abs/1606.05250.

Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The

state of sparsity in deep neural networks. CoRR, Danilo Jimenez Rezende, Shakir Mohamed, and Daan

abs/1902.09574. Wierstra. 2014. Stochastic backpropagation and ap-

proximate inference in deep generative models. In

Michael Glass, Alfio Gliozzo, Rishav Chakravarti, An- Proceedings of the 31st International Conference

thony Ferritto, Lin Pan, G P Shrivatsa Bhargav, Di- on Machine Learning, volume 32 of Proceedings of

nesh Garg, and Avirup Sil. 2019. Span selection pre- Machine Learning Research, pages 1278–1286, Be-

training for question answering. jing, China. PMLR.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Victor Sanh, Lysandre Debut, Julien Chaumond, and

Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Thomas Wolf. 2019. Distilbert, a distilled version

2019. Tinybert: Distilling bert for natural language of bert: smaller, faster, cheaper and lighter.

understanding.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei

Yao, Amir Gholami, Michael W. Mahoney, and Kurt

Diederik P. Kingma and Max Welling. 2014. Auto-

Keutzer. 2019. Q-bert: Hessian based ultra low pre-

encoding variational Bayes. International Confer-

cision quantization of bert.

ence on Learning Representations.

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Vechtomova, and Jimmy Lin. 2019. Distilling task-

Anna Rumshisky. 2019. Revealing the dark secrets specific knowledge from bert into simple neural net-

of bert. CoRR. works.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Ari-

Kevin Gimpel, Piyush Sharma, and Radu Soricut. vazhagan, Xin Li, and Amelia Archer. 2019. Small

2019. Albert: A lite bert for self-supervised learn- and practical bert models for sequence labeling.

ing of language representations.

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Toutanova. 2019. Well-read students learn better:

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, On the importance of pre-training compact models.

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized BERT pretraining Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

approach. CoRR, abs/1907.11692. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

Christos Louizos, Max Welling, and Diederik P. you need.

Kingma. 2017. Learning sparse neural networks

through l0 regularization. Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-

nrich, and Ivan Titov. 2019. Analyzing multi-head

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. self-attention: Specialized heads do the heavy lift-

2017. The concrete distribution: A continuous re- ing, the rest can be pruned. CoRR, abs/1905.09418.

laxation of discrete random variables. In 5th Inter- Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2019.

national Conference on Learning Representations, Structured pruning of large language models.

ICLR 2017, Toulon, France, April 24-26, 2017, Con-

ference Track Proceedings. OpenReview.net. Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and

Daxin Jiang. 2019a. Model compression with multi-

Paul Michel, Omer Levy, and Graham Neubig. 2019. task knowledge distillation for web-scale question

Are sixteen heads really better than one? CoRR, answering system.

abs/1905.10650.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.

Shar Narasimhan. 2019. Nvidia clocks worlds fastest Carbonell, Ruslan Salakhutdinov, and Quoc V.

bert training time and largest transformer based Le. 2019b. Xlnet: Generalized autoregressive

model, paving path for advanced conversational ai. pretraining for language understanding. CoRR,

abs/1906.08237.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke A Supplemental Material

Zettlemoyer. 2018. Deep contextualized word rep-

resentations.

Know what you don’t know: Unanswerable ques-

tions for squad. CoRR, abs/1806.03822.