Neural Network Essay Grading

Automatic Text Scoring Using Neural Networks
Dimitrios Alikaniotis Helen Yannakoudakis Marek Rei

Department of Theoretical The ALTA Institute The ALTA Institute
and Applied Linguistics Computer Laboratory Computer Laboratory
University of Cambridge University of Cambridge University of Cambridge
Cambridge, UK Cambridge, UK Cambridge, UK
da352@cam.ac.uk hy260@cl.cam.ac.uk mr472@cl.cam.ac.uk
Abstract Burstein, 2006; Rudner and Liang, 2002; Elliot,

2003; Landauer et al., 2003; Briscoe et al., 2010;
Automated Text Scoring (ATS) provides Yannakoudakis et al., 2011; Sakaguchi et al., 2015,
a cost-effective and consistent alternative
arXiv:1606.04289v2 [cs.CL] 16 Jun 2016
among others), overviews of which can be found

to human marking. However, in order to in various studies (Williamson, 2009; Dikli, 2006;
achieve good performance, the predictive Shermis and Hammer, 2012). Implicitly or ex-
features of the system need to be manually plicitly, previous work has primarily treated text
engineered by human experts. We intro- scoring as a supervised text classification task, and
duce a model that forms word representa- has utilized a large selection of techniques, ranging
tions by learning the extent to which spe- from the use of syntactic parsers, via vectorial se-
cific words contribute to the text’s score. mantics combined with dimensionality reduction,
Using Long-Short Term Memory networks to generative and discriminative machine learning.
to represent the meaning of texts, we As multiple factors influence the quality of texts,
demonstrate that a fully automated frame- ATS systems typically exploit a large range of tex-
work is able to achieve excellent results tual features that correspond to different proper-
over similar approaches. In an attempt to ties of text, such as grammar, vocabulary, style,
make our results more interpretable, and topic relevance, and discourse coherence and co-
inspired by recent advances in visualiz- hesion. In addition to lexical and part-of-speech
ing neural networks, we introduce a novel (POS) ngrams, linguistically deeper features such
method for identifying the regions of the as types of syntactic constructions, grammatical
text that the model has found more discrim- relations and measures of sentence complexity are
inative. among some of the properties that form an ATS
system’s internal marking criteria. The final rep-
1 Introduction
resentation of a text typically consists of a vector
Automated Text Scoring (ATS) refers to the set of features that have been manually selected and
of statistical and natural language processing tech- tuned to predict a score on a marking scale.
niques used to automatically score a text on a mark- Although current approaches to scoring, such as
ing scale. The advantages of ATS systems have regression and ranking, have been shown to achieve
been established since Project Essay Grade (PEG) performance that is indistinguishable from that of
(Page, 1967; Page, 1968), one of the earliest sys- human examiners, there is substantial manual ef-
tems whose development was largely motivated by fort involved in reaching these results on different
the prospect of reducing labour-intensive marking domains, genres, prompts and so forth. Linguistic
activities. In addition to providing a cost-effective features intended to capture the aspects of writing
and efficient approach to large-scale grading of to be assessed are hand-selected and tuned for spe-
(extended) text, such systems ensure a consistent cific domains. In order to perform well on different
application of marking criteria, therefore facilitat- data, separate models with distinct feature sets are
ing equity in scoring. typically tuned.
There is a large body of literature with regards Prompted by recent advances in deep learning
to ATS systems of text produced by non-native and the ability of such systems to surpass state-of-
English-language learners (Page, 1968; Attali and the-art models in similar areas (Tang, 2015; Tai et
al., 2015), we propose the use of recurrent neural scores calculated from the parser’s cost vector.
network models for ATS. Multi-layer neural net- The Bayesian Essay Test Scoring sYstem (Rud-
works are known for automatically learning useful ner and Liang, 2002) investigates multinomial and
features from data, with lower layers learning basic Bernoulli Naive Bayes models to classify texts
feature detectors and upper levels learning more based on shallow content and style features. e-
high-level abstract features (Lee et al., 2009). Addi- Rater (Attali and Burstein, 2006), developed by
tionally, recurrent neural networks are well-suited the Educational Testing Service, was one of the
for modeling the compositionality of language and first systems to be deployed for operational scor-
have been shown to perform very well on the task of ing in high-stakes assessments. The model uses
language modeling (Mikolov et al., 2011; Chelba a number of different features, including aspects
et al., 2013). We therefore propose to apply these of grammar, vocabulary and style (among others),
network structures to the task of scoring, in order to whose weights are fitted to a marking scheme by
both improve the performance of ATS systems and regression.
learn the required feature representations for each Chen et al. (2010) use a voting algorithm and
dataset automatically, without the need for manual address text scoring within a weakly supervised
tuning. More specifically, we focus on predicting a bag-of-words framework. Yannakoudakis et al.
holistic score for extended-response writing items.1 (2011) extract deep linguistic features and employ
However, automated models are not a panacea, a discriminative learning-to-rank model that out-
and their deployment depends largely on the ability performs regression.
to examine their characteristics, whether they mea- Recently, McNamara et al. (2015) used a hier-
sure what is intended to be measured, and whether achical classification approach to scoring, utilizing
their internal marking criteria can be interpreted in linguistic, semantic and rhetorical features, among
a meaningful and useful way. The deep architecture others. Farra et al. (2015) utilize variants of logistic
of neural network models, however, makes it rather and linear regression and develop models that score
difficult to identify and extract those properties of persuasive essays based on features extracted from
text that the network has identified as discrimina- opinion expressions and topical elements.
tive. Therefore, we also describe a preliminary There have also been attempts to incorporate
method for visualizing the information the model more diverse features to text scoring models. Kle-
is exploiting when assigning a specific score to an banov and Flor (2013) demonstrate that essay scor-
input text. ing performance is improved by adding to the
model information about percentages of highly as-
2 Related Work sociated, mildly associated and dis-associated pairs
of words that co-exist in a given text. Somasun-
In this section, we describe a number of the more
daran et al. (2014) exploit lexical chains and their
influential and/or recent approaches in automated
interaction with discourse elements for evaluating
text scoring of non-native English-learner writing.
the quality of persuasive essays with respect to dis-
Project Essay Grade (Page, 1967; Page, 1968;
course coherence. Crossley et al. (2015) identify
Page, 2003) is one of the earliest automated scoring
student attributes, such as standardized test scores,
systems, predicting a score using linear regression
as predictive of writing success and use them in
over vectors of textual features considered to be
conjunction with textual features to develop essay
proxies of writing quality. Intelligent Essay Asses-
scoring models.
sor (Landauer et al., 2003) uses Latent Semantic
In 2012, Kaggle,2 sponsored by the Hewlett
Analysis to compute the semantic similarity be-
Foundation, hosted the Automated Student Assess-
tween texts at specific grade points and a test text,
ment Prize (ASAP) contest, aiming to demonstrate
which is assigned a score based on the ones in the
the capabilities of automated text scoring systems
training set to which it is most similar. Lonsdale
(Shermis, 2015). The dataset released consists
and Strong-Krause (2003) use the Link Grammar
of around twenty thousand texts (60% of which
parser (Sleator and Templerley, 1995) to analyse
are marked), produced by middle-school English-
and score texts based on the average sentence-level
speaking students, which we use as part of our
1
The task is also referred to as Automated Essay Scor- experiments to develop our models.
ing. Throughout this paper, we use the terms text and essay
2
(scoring) interchangeably. http://www.kaggle.com/c/asap-aes/
3 Models
f (s), bo ∈ R1
3.1 C&W Embeddings
Woh ∈ RH×1
Collobert and Weston (2008) and Collobert et al. Whi ∈ RD×H
(2011) introduce a neural network architecture s ∈ RD
(Fig. 1a) that learns a distributed representation for
each word w in a corpus based on its local context. bo ∈ R H
Concretely, suppose we want to learn a representa-
tion for some target word wt found in an n-sized se-
quence of words S = (w1 , . . . , wt , . . . , wn ) based where M, Woh , Whi , bo , bh are learnable param-
on the other words which exist in the same se- eters, D, H are hyperparameters controlling the
quence (∀wi ∈ S | wi 6= wt ). In order to derive size of the input and the hidden layer, respectively;
this representation, the model learns to discrimi- σ is the application of an element-wise non-linear
nate between S and some ‘noisy’ counterpart S 0 function (htanh in this case).
in which the target word wt has been substituted The model learns word embeddings by ranking
for a randomly sampled word from the vocabulary: the activation of the true sequence S higher than
S 0 = (w1 , . . . , wc , . . . , wn | wc ∼ V). In this way, the activation of its ‘noisy’ counterpart S 0 . The
every word w is more predictive of its local context objective of the model then becomes to minimize
than any other random word in the corpus. the hinge loss which ensures that the activations
of the original and ‘noisy’ ngrams will differ by
Every word in V is mapped to a real-valued
at least 1:
vector in Ω via a mapping function C(·) such
that C(wi ) = hM?i i, where M ∈ RD×|V| is
the embedding matrix and hM?i i is the ith col- losscontext (target, corrupt) =
umn of M. The network takes S as input by (5)
concatenating the vectors of the words found in [1 − f (st ) + f (sck )]+ , ∀k ∈ ZE
it; st = hC(w1 )| k . . . kC(wt )| k . . . kC(wn )| i ∈
where E is another hyperparameter controlling the
RnD . Similarly, S 0 is formed by substituting C(wt )
number of ‘noisy’ sequences we give along with the
for C(wc ) ∼ M | wc 6= wt .
correct sequence (Mikolov et al., 2013; Gutmann
The input vector is then passed through a and Hyvärinen, 2012).
hard tanh layer defined as,
3.2 Augmented C&W model
Following Tang (2015), we extend the previous
 model to capture not only the local linguistic envi-
−1
 x < −1 ronment of each word, but also how each word con-
htanh(x) = x −1 6 x 6 1 (1) tributes to the overall score of the essay. The aim
 here is to construct representations which, along
1 x>1

with the linguistic information given by the linear
order of the words in each sentence, are able to
capture usage information. Words such as is, are,
which feeds a single linear unit in the output layer. to, at which appear with any essay score are consid-
The function that is computed by the network is ered to be under-informative in the sense that they
ultimately given by (4): will activate equally both on high and low scoring
essays. Informative words, on the other hand, are
the ones which would have an impact on the essay
score (e.g., spelling mistakes).
| In order to capture those score-specific word em-
st = hM|?1 k . . . kM|?t k . . . kM|?n i (2)
beddings (SSWEs), we extend (4) by adding a fur-
i = σ(Whi st + bh ) (3)
ther linear unit in the output layer that performs
f (st ) = Woh i + bo (4) linear regression, predicting the essay score. Using
... ... ... ... ... ...
the recent advances the recent advances
(a) (b)
Figure 1: Architecture of the original C&W model (left) and of our extended version (right).
(2), the activations of the network (presented in not reflect the fact that the mis-spelled words tend
Fig. 1b) are given by: to appear in lower scoring essays. Using SSWEs,
the correctly spelled words are pulled apart in the
fss (s) = Woh1 i + bo1 (6) vector space from the incorrectly spelled ones, re-
fcontext (s) = Woh2 i + bo2 (7) taining, however, the information that labtop and
copmuter are still contextually related (Fig. 2b).
fss (s) ∈ [min(score), max(score)] 3.3 Long-Short Term Memory Network
bo1 ∈ R1 We use the SSWEs obtained by our model to derive
1×H continuous representations for each essay. We treat
Woh1 ∈ R
each essay as a sequence of tokens and explore the
The error we minimize for fss (where ss stands for use of uni- and bi-directional (Graves, 2012) Long-
score specific) is the mean squared error between Short Term Memory networks (LSTMs) (Hochre-
the predicted ŷ and the actual essay score y: iter and Schmidhuber, 1997) in order to embed
these sequences in a vector of fixed size. Both uni-
N and bi-directional LSTMs have been effectively
1 X
lossscore (s) = (ŷi − yi )2 (8) used for embedding long sequences (Hermann et
N
i=1 al., 2015). LSTMs are a kind of recurrent neural
From (5) and (8) we compute the overall loss network (RNN) architecture in which the output at
function as a weighted linear combination of the time t is conditioned on the input s both at time t
two loss functions (9), back-propagating the error and at time t − 1:
gradients to the embedding matrix M:
yt = Wyh ht + by (10)
α · losscontext (s, s0 ) ht = H(Whs st + Whh ht−1 + bh ) (11)
lossoverall (s) = (9)
+ (1 − α) · lossscore (s) where st is the input at time t, and H is usually
an element-wise application of a non-linear func-
where α is the hyper-parameter determining how tion. In LSTMs, H is substituted for a composite
the two error functions should be weighted. α val- function defining ht as:
ues closer to 0 will place more weight on the score-
specific aspect of the embeddings, whereas values
σ(Wis st + Wih ht−1 +
closer to 1 will favour the contextual information. it = (12)
Fig. 2 shows the advantage of using SSWEs in Wic ct−1 + bi )
the present setting. Based solely on the informa- σ(Wf s st + Wf h ht−1 +
ft = (13)
tion provided by the linguistic environment, words Wf c ct−1 + bf )
such as computer and laptop are going to be placed
it g(Wcs st + Wch ht−1 + bc )+
together with their mis-spelled counterparts cop- ct = (14)
muter and labtop (Fig. 2a). This, however, does ft ct−1
COMPUTER
LAPTOP
3 3
COMPUTER
COPMUTAR
2 2
LAPTOP
LABTOP
1 1
COPMUTAR
LABTOP
1 2 3 4 1 2 3 4
(a) (b)
Standard neural embeddings Score-specific word embeddings
Figure 2: Comparison between standard and score-specific word embeddings. By virtue of appearing in
similar environments, standard neural embeddings will place the correct and the incorrect spelling closer
in the vector space. However, since the mistakes are found in lower scoring essays, SSWEs are able to
discriminate between the correct and the incorrect versions without loss in contextual meaning.
y
←
− →
−
interpretation of a word at some point ti might
h h be different once we know the word at ti+5 . An
effective way to get around this issue has been to
train the LSTM in a bidirectional manner. This
requires doing both a forward and a backward pass
of the sequence (i.e., feeding the words from left
wthe wrecent wadvances w...
to right and from right to left). The hidden layer
element in (10) can therefore be re-written as the
the recent advances ... concatenation of the forward and backward hidden
vectors:
Figure 3: A single-layer Long Short Term Memory
(LSTM) network. The word vectors wi enter the
input layer one at a time. The hidden layer that has ←
−| !
ht
been formed at the last timestep is used to predict yt = Wyh →
−| + by (17)
ht
the essay score using linear regression. We also
explore the use of bi-directional LSTMs (dashed
arrows). For ‘deeper’ representations, we can stack We feed the embedding of each word found in
more LSTM layers after the hidden layer shown each essay to the LSTM one at a time, zero-padding
here. shorter sequences. We form D-dimensional essay
embeddings by taking the activation of the LSTM
layer at the timestep where the last word of the
σ(Wos st + Woh ht−1 + essay was presented to the network. In the case of
ot = (15)
Woc ct + bo ) bi-directional LSTMs, the two independent passes
ht = ot h(ct ) (16) of the essay (from left to right and from right to
left) are concatenated together to predict the essay
where g, σ and h are element-wise non-linear func- score. These essay embeddings are then fed to a
tions such as the logistic sigmoid ( 1+e1−x ) and the linear unit in the output layer which predicts the
2z
−1
hyperbolic tangent ( ee2z +1 ); is the Hadamard essay score (Fig. 3). We use the mean square error
product; W, b are the learned weights and biases between the predicted and the gold score as our loss
respectively; and i, f, o and c are the input, forget, function, and optimize with RMSprop (Dauphin et
output gates and the cell activation vectors respec- al., 2015), propagating the errors back to the word
tively. embeddings.3
Training the LSTM in a uni-directional manner 3
The maximum time for jointly training a particular SSWE
(i.e., from left to right) might leave out important + LSTM combination took about 55–60 hours on an Amazon
information about the sentence. For example, our EC2 g2.2xlarge instance (average time was 27–30 hours).
3.4 Other Baselines hidden layer back to the embedding matrix
We train a Support Vector Regression model (see (i.e., we do not provide any pre-trained word
Section 4), which is one of the most widely used embeddings).4
approaches in text scoring. We parse the data us- 4 Dataset
ing the RASP parser (Briscoe et al., 2006) and
extract a number of different features for assess- The Kaggle dataset contains 12.976 essays rang-
ing the quality of the essays. More specifically, ing from 150 to 550 words each, marked by two
we use character and part-of-speech unigrams, bi- raters (Cohen’s κ = 0.86). The essays were writ-
grams and trigrams; word unigrams, bigrams and ten by students ranging from Grade 7 to Grade
trigrams where we replace open-class words with 10, comprising eight distinct sets elicited by eight
their POS; and the distribution of common nouns, different prompts, each with distinct marking cri-
prepositions, and coordinators. Additionally, we teria and score range.5 For our experiments, we
extract and use as features the rules from the phrase- use the resolved combined score between the two
structure tree based on the top parse for each sen- raters, which is calculated as the average between
tence, as well as an estimate of the error rate based the two raters’ scores (if the scores are close), or is
on manually-derived error rules. determined by a third expert (if the scores are far
N grams are weighted using tf–idf, while the rest apart). Currently, the state-of-the-art on this dataset
are count-based and scaled so that all features have has achieved a Cohen’s κ = 0.81 (using quadratic
approximately the same order of magnitude. The weights). However, the test set was released with-
final input vectors are unit-normalized to account out the gold score annotations, rendering any com-
for varying text-length biases. parisons futile, and we are therefore restricted in
Further to the above, we also explore the use of splitting the given training set to create a new test
the Distributed Memory Model of Paragraph Vec- set.
tors (PV-DM) proposed by Le and Mikolov (2014), The sets where divided as follows: 80% of the
as a means to directly obtain essay embeddings. entire dataset was reserved for training/validation,
PV-DM takes as input word vectors which make up and 20% for testing. 80% of the training/validation
ngram sequences and uses those to predict the next subset was used for actual training, while the re-
word in the sequence. A feature of PV-DM, how- maining 20% for validation (in absolute terms for
ever, is that each ‘paragraph’ is assigned a unique the entire dataset: 64% training, 16% validation,
vector which is used in the prediction. This vector, 20% testing). To facilitate future work, we release
therefore, acts as a ‘memory’, retaining informa- the ids of the validation and test set essays we used
tion from all contexts that have appeared in this in our experiments, in addition to our source code
paragraph. Paragraph vectors are then fed to a lin- and various hyperparameter values.6
ear regression model to obtain essay scores (we
5 Experiments
refer to this model as doc2vec).
Additionally, we explore the effect of our score- 5.1 Results
specific method for learning word embeddings, The hyperparameters for our model were as fol-
when compared against three different kinds of lows: sizes of the layers H, D, the learning rate
word embeddings: η, the window size n, the number of ‘noisy’ se-
quences E and the weighting factor α. Also the
• word2vec embeddings (Mikolov et al.,
hyperparameters of the LSTM were the size of the
2013) trained on our training set (see Sec-
LSTM layer DLST M as well as the dropout rate r.
tion 4).
Since the search space would be massive for grid
4
• Publicly available word2vec embeddings Another option would be to use standard C&W embed-
(Mikolov et al., 2013) pre-trained on the dings; however, this is equivalent to using SSWEs with α = 1,
which we found to produce low results.
Google News corpus (ca. 100 billion words), 5
Five prompts employed a holistic scoring rubric, one was
which have been very effective in capturing scored with a two-trait rubric, and two were scored with a
solely contextual information. multi-trait rubric, but reported as a holistic score (Shermis and
Hammer, 2012).
6
The code, by-model hyperparameter configurations and
• Embeddings that are constructed on the fly by the IDs of the testing set are available at https://
the LSTM, by propagating the errors from its github.com/dimalik/ats/.
Model Spearman’s ρ Pearson r RMSE Cohen’s κ
doc2vec 0.62 0.63 4.43 0.85
SVM 0.78 0.77 8.85 0.75
LSTM 0.59 0.60 6.8 0.54
BLSTM 0.7 0.5 7.32 0.36
Two-layer LSTM 0.58 0.55 7.16 0.46
Two-layer BLSTM 0.68 0.52 7.31 0.48
word2vec + LSTM 0.68 0.77 5.39 0.76
word2vec + BLSTM 0.75 0.86 4.34 0.85
word2vec + Two-layer LSTM 0.76 0.71 6.02 0.69
word2vec + Two-layer BLSTM 0.78 0.83 4.79 0.82
word2vecpre-trained + Two-layer BLSTM 0.79 0.91 3.2 0.92
SSWE + LSTM 0.8 0.94 2.9 0.94
SSWE + BLSTM 0.8 0.92 3.21 0.95
SSWE + Two-layer LSTM 0.82 0.93 3 0.94
SSWE + Two-layer BLSTM 0.91 0.96 2.4 0.96
Table 1: Results of the different models on the Kaggle dataset. All resulting vectors were trained
using linear regression. We optimized the parameters using a separate validation set (see text) and
report the results on the test set.
search, the best hyperparameters were determined edge and consists of hand-picked features which
using Bayesian Optimization (Snoek et al., 2012). have achieved excellent performance in similar
In this context, the performance of our models in tasks (Yannakoudakis et al., 2011). However, in
the validation set is modeled as a sample from a terms of RMSE, it is among the lowest perform-
Gaussian process (GP) by constructing a probabilis- ing models (8.85), together with ‘BLSTM’ and
tic model for the error function and then exploiting ‘Two-layer BLSTM’. Deep models in combination
this model to make decisions about where to next with word2vec (i.e., ‘word2vec + Two-layer
evaluate the function. The hyperparameters for LSTM’ and ‘word2vec + Two-layer BLSTM’)
our baselines were also determined using the same and SVMs are comparable in terms of r and ρ,
methodology. though not in terms of RMSE, where the former
All models are trained on our training produce better results, with RMSE improving by
set (see Section 4), except the one prefixed half (4.79). doc2vec also produces competitive
‘word2vecpre-trained ’ which uses pre-trained em- RMSE results (4.43), though correlation is much
beddings on the Google News Corpus. We re- lower (ρ = 0.62 and r = 0.63).
port the Spearman’s rank correlation coefficient The two BLSTMs trained with word2vec em-
ρ, Pearson’s product-moment correlation coeffi- beddings are among the most competitive models
cient r, and the root mean square error (RMSE) in terms of correlation and outperform all the mod-
between the predicted scores and the gold standard els, except the ones using pre-trained embeddings
on our test set, which are considered more appro- and SSWEs. Increasing the number of hidden lay-
priate metrics for evaluating essay scoring systems ers and/or adding bi-directionality does not always
(Yannakoudakis and Cummins, 2015). However, improve performance, but it clearly helps in this
we also report Cohen’s κ with quadratic weights, case and performance improves compared to their
which was the evaluation metric used in the Kaggle uni-directional counterparts.
competition. Performance of the models is shown Using pre-trained word embeddings improves
in Table 1. the results further. More specifically, we found
In terms of correlation, SVMs produce com- ‘word2vecpre-trained + Two-layer BLSTM’ to be
petitive results (ρ = 0.78 and r = 0.77), out- the best configuration, increasing correlation to
performing doc2vec, LSTM and BLSTM, as 0.79 ρ and 0.91 r, and reducing RMSE to 3.2.
well as their deep counterparts. As described We note however that this is not an entirely
above, the SVM model has rich linguistic knowl- fair comparison as these are trained on a much
larger corpus than our training set (which we dropped to ρ = 0.15.)
use to train our models). Nevertheless, when The number of ‘noisy’ sequences was set to 200,
we use our SSWEs models we are able to outper- which was the highest possible setting we consid-
form ‘word2vecpre-trained + Two-layer BLSTM’, ered, although this might be related more to the
even though our embeddings are trained on fewer size of the corpus (see Mikolov et al. (2013) for
data points. More specifically, our best model a similar discussion) rather than to our approach.
(‘SSWE + Two-layer BLSTM’) improves correla- Finally, the optimal value for DLST M was 10 (the
tion to ρ = 0.91 and r = 0.96, as well as RMSE to lowest value investigated), which again may be
2.4, giving a maximum increase of around 10% in corpus-dependent.
correlation. Given the results of the pre-trained
model, we believe that the performance of our 6 Visualizing the black box
best SSWE model will further improve should more
training data be given to it.7 In this section, inspired by recent advances in
(de-) convolutional neural networks in computer
5.2 Discussion vision (Simonyan et al., 2013) and text summa-
rization (Denil et al., 2014), we introduce a novel
Our SSWE + LSTM approach having no prior
method of generating interpretable visualizations
knowledge of the grammar of the language or the
of the network’s performance. In the present con-
domain of the text, is able to score the essays in a
text, this is particularly important as one advantage
very human-like way, outperforming other state-of-
of the manual methods discussed in § 2 is that we
the-art systems. Furthermore, while we tuned the
are able to know on what grounds the model made
models’ hyperparameters on a separate validation
its decisions and which features are most discrimi-
set, we did not perform any further pre-processing
native.
of the text other than simple tokenization.
At the outset, our goal is to assess the ‘quality’
In the essay scoring literature, text length tends
of our word vectors. By ‘quality’ we mean the level
to be a strong predictor of the overall score. In
to which a word appearing in a particular context
order to investigate any possible effects of essay
would prove to be problematic for the network’s
length, we also calculate the correlation between
prediction. In order to identify ‘high’ and ‘low’
the gold scores and the length of the essays. We
quality vectors, we perform a single pass of an es-
find that the correlations on the test set are relatively
say from left to right and let the LSTM make its
low (r = 0.3, ρ = 0.44), and therefore conclude
score prediction. Normally, we would provide the
that there are no such strong effects.
gold scores and adjust the network weights based
As described above, we used Bayesian Optimiza-
on the error gradients. Instead, we provide the net-
tion to find optimal hyperparameter configurations
work with a pseudo-score by taking the maximum
in fewer steps than in regular grid search. Us-
score this specific essay can take9 and provide this
ing this approach, the optimization model showed
as the ‘gold’ score. If the word vector is of ‘high’
some clear preferences for some parameters which
quality (i.e., associated with higher scoring texts),
were associated with better scoring models:8 the
then there is going to be little adjustment to the
number of ‘noisy’ sequences E, the weighting fac-
weights in order to predict the highest score possi-
tor α and the size of the LSTM layer DLST M .
ble. Conversely, providing the minimum possible
The optimal α value was consistently set to 0.1,
score (here 0), we can assess how ‘bad’ our word
which shows that our SSWE approach was neces-
vectors are. Vectors which require minimal adjust-
sary to capture the usage of the words. Perfor-
ment to reach the lowest score are considered of
mance dropped considerably as α increased (less
‘lower’ quality. Note that since we do a complete
weight on SSWEs and more on the contextual as-
pass over the network (without doing any weight
pect). When using α = 1, which is equivalent to
updates), the vector quality is going to be essay
using the basic C&W model, we found that per-
dependent.
formance was considerably lower (e.g., correlation
Concretely, using the network function f (x) as
7
Our approach outperforms all the other models in terms computed by Eq. (12) – (17), we can approximate
of Cohen’s κ too.
8 9
For the best scoring model the hyperparameters were as Note the in the Kaggle dataset essays from different essay
follows: D = 200, H = 100, η = 1e − 7, n = 9, E = sets have different maximum scores. Here we take as ỹmax
200, α = 0.1, DLST M = 10, r = 0.5. the essay set maximum rather than the global maximum.
. . . way to show that Saeng is a determined . . . .
. . . sometimes I do . Being patience is being . . .
. . . which leaves the reader satisfied . . .
. . . is in this picture the cyclist is riding a dry and area which could mean that it is very
and the looks to be going down hill there looks to be a lot of turns . . . .
. . . The only reason im putting this in my own way is because know one is
patient in my family . . . .
. . . Whether they are building hand-eye coordination , researching a country , or family and
friends through @CAPS3 , @CAPS2 , @CAPS6 the internet is highly and
I hope you feel the same way .
Table 2: Several example visualizations created by our LSTM. The full text of the essay is shown in black
and the ‘quality’ of the word vectors appears in color on a range from dark red (low quality) to dark green
(high quality).
the loss induced by feeding the pseudo-scores by sometimes incorrectly, the model would not be
taking the magnitude of each error vector (18) – able to distinguish between them. Two possible
(19). Since limkwk2 →0 ŷ = y, this magnitude solutions to this problem are to either provide the
should tell us how much an embedding needs to gold score at each timestep which results into a
change in order to achieve the gold score (here very computationally expensive endeavour, or to
pseudo-score). In the case where we provide the feed sentences or phrases of smaller size for which
minimum as a pseudo-score, a kwk2 value closer the scoring would be more consistent.10
to zero would indicate an incorrectly used word.
For the results reported here, we combine the mag- 7 Conclusion
nitudes produced from giving the maximum and In this paper, we introduced a deep neural network
minimum pseudo-scores into a single score, com- model capable of representing both local contextual
puted as L(ỹmax , f (x)) − L(ỹmin , f (x)), where: and usage information as encapsulated by essay
scoring. This model yields score-specific word em-
beddings used later by a recurrent neural network
L(ỹ, f (x)) ≈ kwk2 (18)
in order to form essay representations.
∂L We have shown that this kind of architecture is
w = ∇L(x) , (19)
∂x (ỹ,f (x)) able to surpass similar state-of-the-art systems, as
well as systems based on manual feature engineer-
qP kwk2 is the vector Euclidean norm w =
where ing which have achieved results close to the upper
N 2
i=1 wi ; L(·) is the mean squared error as in bound in past work. We also introduced a novel
Eq. (8); and ỹ is the essay pseudo-score. way of exploring the basis of the network’s inter-
We show some examples of this visualization nal scoring criteria, and showed that such models
procedure in Table 2. The model is capable of pro- are interpretable and can be further exploited to
viding positive feedback. Correctly placed punctua- provide useful feedback to the author.
tion or long-distance dependencies (as in Sentence
6 are . . . researching) are particularly favoured by Acknowledgments
the model. Conversely, the model does not deal The first author is supported by the Onassis Founda-
well with proper names, but is able to cope with tion. We would like to thank the three anonymous
POS mistakes (e.g., Being patience or the internet reviewers for their valuable feedback.
is highly and . . . ). However, as seen in Sentence 3
the model is not perfect and returns a false negative 10
We note that the same visualization technique can be
in the case of satisfied. used to show the ‘goodness’ of phrases/sentences. Within the
One potential drawback of this approach is that phrase setting, after feeding the last word of the phrase to the
the gradients are calculated only after the end of the network, the LSTM layer will contain the phrase embedding.
Then, we can assess the ‘goodness’ of this embedding by eval-
essay. This means that if a word appears multiple uating the error gradients after predicting the highest/lowest
times within an essay, sometimes correctly and score.
References [Farra et al.2015] Noura Farra, Swapna Somasundaran,
and Jill Burstein. 2015. Scoring persuasive essays
[Attali and Burstein2006] Yigal Attali and Jill Burstein. using opinions and their targets. In Proceedings of
2006. Automated essay scoring with e-Rater v.2.0. the Tenth Workshop on Innovative Use of NLP for
Journal of Technology, Learning, and Assessment, Building Educational Applications, pages 64–74.
4(3):1–30.
[Graves2012] Alex Graves. 2012. Supervised Se-
[Briscoe et al.2006] Ted Briscoe, John Carroll, and Re-
quence Labelling with Recurrent Neural Networks.
becca Watson. 2006. The second release of the
Springer Berlin Heidelberg.
RASP system. In Proceedings of the COLING/ACL,
volume 6. [Gutmann and Hyvärinen2012] Michael U. Gutmann
and Aapo Hyvärinen. 2012. Noise-contrastive es-
[Briscoe et al.2010] Ted Briscoe, Ben Medlock, and
timation of unnormalized statistical models, with ap-
Øistein E. Andersen. 2010. Automated assess-
plications to natural image statistics. J. Mach. Learn.
ment of ESOL free text examinations. Technical Re-
Res., 13:307–361, February.
port UCAM-CL-TR-790, University of Cambridge,
Computer Laboratory, nov. [Hermann et al.2015] Karl Moritz Hermann, Tom
[Chelba et al.2013] Ciprian Chelba, Tomáš Mikolov, Koisk, Edward Grefenstette, Lasse Espeholt, Will
Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Kay, Mustafa Suleyman, and Phil Blunsom. 2015.
Koehn, and Tony Robinson. 2013. One Billion Teaching machines to read and comprehend. Jun.
Word Benchmark for Measuring Progress in Statis- [Hochreiter and Schmidhuber1997] S Hochreiter and
tical Language Modeling. In arXiv preprint. J Schmidhuber. 1997. Long short-term memory.
[Chen et al.2010] YY Chen, CL Liu, TH Chang, and Neural computation, 9(8):1735–1780.
CH Lee. 2010. An Unsupervised Automated Essay [Klebanov and Flor2013] Beata Beigman Klebanov
Scoring System. IEEE Intelligent Systems, pages and Michael Flor. 2013. Word association profiles
61–67. and their use for automated scoring of essays. In
[Collobert and Weston2008] Ronan Collobert and Ja- Proceedings of the 51st Annual Meeting of the
son Weston. 2008. A unified architecture for Association for Computational Linguistics, pages
natural language processing: deep neural networks 1148–1158.
with multitask learning. Proceedings of the Twenty-
[Landauer et al.2003] Thomas K. Landauer, Darrell La-
Fifth international conference on Machine Learning,
ham, and Peter W. Foltz. 2003. Automated scoring
pages 160–167, July.
and annotation of essays with the Intelligent Essay
[Collobert et al.2011] Ronan Collobert, Jason Weston, Assessor. In M.D. Shermis and J.C. Burstein, edi-
Leon Bottou, Michael Karlen, Koray Kavukcuoglu, tors, Automated essay scoring: A cross-disciplinary
and Pavel Kuksa. 2011. Natural language process- perspective, pages 87–112.
ing (almost) from scratch. Mar.
[Le and Mikolov2014] Quoc V. Le and Tomas Mikolov.
[Crossley et al.2015] Scott Crossley, Laura K Allen, Er- 2014. Distributed representations of sentences and
ica L Snow, and Danielle S McNamara. 2015. documents. May.
Pssst... textual features... there is more to automatic
essay scoring than just you! In Proceedings of the [Lee et al.2009] Honglak Lee, Roger Grosse, Rajesh
Fifth International Conference on Learning Analyt- Ranganath, and Andrew Y. Ng. 2009. Convo-
ics And Knowledge, pages 203–207. ACM. lutional deep belief networks for scalable unsuper-
vised learning of hierarchical representations. Pro-
[Dauphin et al.2015] Yann N. Dauphin, Harm de Vries, ceedings of the 26th Annual International Confer-
and Yoshua Bengio. 2015. Equilibrated adaptive ence on Machine Learning ICML 09.
learning rates for non-convex optimization. Feb.
[Lonsdale and Strong-Krause2003] Deryle Lonsdale
[Denil et al.2014] Misha Denil, Alban Demiraj, Nal and D. Strong-Krause. 2003. Automated rating of
Kalchbrenner, Phil Blunsom, and Nando de Freitas. ESL essays. In Proceedings of the HLT-NAACL
2014. Modelling, visualising and summarising doc- 2003 Workshop: Building Educational Applications
uments with a single convolutional neural network. Using Natural Language Processing.
Jun.
[McNamara et al.2015] Danielle S McNamara, Scott A
[Dikli2006] Semire Dikli. 2006. An overview of au- Crossley, Rod D Roscoe, Laura K Allen, and Jian-
tomated scoring of essays. Journal of Technology, min Dai. 2015. A hierarchical classification ap-
Learning, and Assessment, 5(1). proach to automated essay scoring. Assessing Writ-
ing, 23:35–59.
[Elliot2003] S. Elliot. 2003. IntellimetricTM : From
here to validity. In M. D. Shermis and J. Burn- [Mikolov et al.2011] Tomáš Mikolov, Stefan Kombrink,
stein, editors, Automated Essay Scoring: A Cross- Anoop Deoras, Lukáš Burget, and Jan Černocký.
Disciplinary Perspective, pages 71–86. Lawrence 2011. RNNLM-Recurrent neural network language
Erlbaum Associates. modeling toolkit. In ASRU 2011 Demo Session.
[Mikolov et al.2013] Tomas Mikolov, I Sutskever, [Tai et al.2015] Kai Sheng Tai, Richard Socher, and
K Chen, G S Corrado, and Jeffrey Dean. 2013. Christopher D. Manning. 2015. Improved semantic
Distributed representations of words and phrases representations from tree-structured long short-term
and their compositionality. In Advances in Neural memory networks. Feb.
Information Processing Systems, pages 3111–3119.
[Tang2015] Duyu Tang. 2015. Sentiment-specific rep-
[Page1967] Ellis B. Page. 1967. Grading essays by resentation learning for document-level sentiment
computer: progress report. In Proceedings of the analysis. In Proceedings of the Eighth ACM Interna-
Invitational Conference on Testing Problems, pages tional Conference on Web Search and Data Mining
87–100. - WSDM '15. Association for Computing Machinery
(ACM).
[Page1968] Ellis B. Page. 1968. The use of the com-
[Williamson2009] D. M. Williamson. 2009. A frame-
puter in analyzing student essays. International Re-
work for implementing automated scoring. Techni-
view of Education, 14(2):210–225, June.
cal report, Educational Testing Service.
[Page2003] E.B. Page. 2003. Project essay grade: PEG. [Yannakoudakis and Cummins2015] Helen Yan-
In M.D. Shermis and J.C. Burstein, editors, Auto- nakoudakis and Ronan Cummins. 2015. Evaluating
mated essay scoring: A cross-disciplinary perspec- the performance of automated text scoring systems.
tive, pages 43–54. In Proceedings of the Tenth Workshop on Innovative
Use of NLP for Building Educational Applications.
[Rudner and Liang2002] L.M. Rudner and Tahung Association for Computational Linguistics (ACL).
Liang. 2002. Automated essay scoring using Bayes’
theorem. The Journal of Technology, Learning and [Yannakoudakis et al.2011] Helen Yannakoudakis, Ted
Assessment, 1(2):3–21. Briscoe, and Ben Medlock. 2011. A new dataset
and method for automatically grading ESOL texts.
[Sakaguchi et al.2015] Keisuke Sakaguchi, Michael In The 49th Annual Meeting of the Association for
Heilman, and Nitin Madnani. 2015. Effective fea- Computational Linguistics: Human Language Tech-
ture integration for automated short answer scoring. nologies, Proceedings of the Conference, 19-24 June,
In Proceedings of the Tenth Workshop on Innovative 2011, Portland, Oregon, USA, pages 180–189.
Use of NLP for Building Educational Applications.
[Shermis and Hammer2012] M Shermis and B Ham-

mer. 2012. Contrasting state-of-the-art automated
scoring of essays: analysis. Technical report, The
University of Akron and Kaggle.
[Shermis2015] Mark D Shermis. 2015. Contrast-

ing state-of-the-art in the machine scoring of short-
form constructed responses. Educational Assess-
ment, 20(1):46–65.
[Simonyan et al.2013] Karen Simonyan, Andrea

Vedaldi, and Andrew Zisserman. 2013. Deep
inside convolutional networks: Visualising image
classification models and saliency maps. 12.
[Sleator and Templerley1995] D.D.K. Sleator and

D. Templerley. 1995. Parsing English with a link
grammar. Proceedings of the 3rd International
Workshop on Parsing Technologies, ACL.
[Snoek et al.2012] Jasper Snoek, Hugo Larochelle, and

Ryan P. Adams. 2012. Practical bayesian optimiza-
tion of machine learning algorithms. Jun.
[Somasundaran et al.2014] Swapna Somasundaran, Jill

Burstein, and Martin Chodorow. 2014. Lexical
chaining for measuring discourse coherence quality
in test-taker essays. In COLING, pages 950–961.
[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,

and Quoc V. Le. 2014. Sequence to sequence learn-
ing with neural networks. Sep.

Neural Network Essay Grading

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Neural Network Essay Grading

Încărcat de

Drepturi de autor:

Formate disponibile

Automatic Text Scoring Using Neural Networks

Dimitrios Alikaniotis Helen Yannakoudakis Marek Rei

Abstract Burstein, 2006; Rudner and Liang, 2002; Elliot,

among others), overviews of which can be found

[Shermis and Hammer2012] M Shermis and B Ham-

[Shermis2015] Mark D Shermis. 2015. Contrast-

[Simonyan et al.2013] Karen Simonyan, Andrea

[Sleator and Templerley1995] D.D.K. Sleator and

[Snoek et al.2012] Jasper Snoek, Hugo Larochelle, and

[Somasundaran et al.2014] Swapna Somasundaran, Jill

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,

S-ar putea să vă placă și