Sunteți pe pagina 1din 23

Int J Artif Intell Educ

DOI 10.1007/s40593-016-0122-z
A RT I C L E

Assessing Text-Based Writing of Low-Skilled


College Students

Dolores Perin 1 & Mark Lauterbach 2

# International Artificial Intelligence in Education Society 2016

Abstract The problem of poor writing skills at the postsecondary level is a large and
troubling one. This study investigated the writing skills of low-skilled adults attending
college developmental education courses by determining whether variables from an
automated scoring system were predictive of human scores on writing quality rubrics.
The human-scored measures were a holistic quality rating for a persuasive essay and an
analytic quality score for a written summary. Both writing samples were based on text
on psychology and sociology topics related to content taught at the introductory
undergraduate level. The study is a modified replication of McNamara et al. (Written
Communication, 27(1), 57–86 2010), who identified several Coh-Metrix variables from
five linguistic classes that reliably predicted group membership (high versus low
proficiency) using human quality scores on persuasive essays written by average-
achieving college students. When discriminant analyses and ANOVAs failed to repli-
cate the McNamara et al. (Written Communication, 27(1), 57–86 2010) findings, the
current study proceeded to analyze all of the variables in the five Coh-Metrix classes.
This larger analysis identified 10 variables that predicted human-scored writing profi-
ciency. Essay and summary scores were predicted by different automated variables.
Implications for instruction and future use of automated scoring to understand the
writing of low-skilled adults are discussed.

Keywords Writingskills . Automatedscoring . Adult students . Persuasiveessay . Written


summary

* Dolores Perin
perin@tc.edu

Mark Lauterbach
markl@brooklyn.cuny.edu

1
Teachers College, Columbia University, 525 W. 120th Street, Box 70, New York, NY 10027, USA
2
Brooklyn College, City University of New York, 2401 James Hall, 2900 Bedford Avenue,
Brooklyn, NY 11210, USA
Int J Artif Intell Educ

Introduction

In recent years there has been a proliferation of automated scoring systems to analyze
written text including student-generated responses (Magliano and Graesser 2012;
Shermis et al. 2016). In particular, automated scoring has been used to assess the
writing ability of students who vary in age and language proficiency (Crossley et al.
2016; Weigle 2013; Wilson et al. 2016). Because of its precision and consistency in
analyzing linguistic and structural aspects of writing (Deane and Quinlan 2010),
automated scoring has potential to contribute to an understanding of the instructional
needs of poor writers. One segment of this population is adults who have completed
secondary education with low skills but have nevertheless been admitted to postsec-
ondary institutions and aspire to earn a college degree (MacArthur et al. 2016). This is a
large population with low literacy skills who often attend college developmental (also
known as remedial) courses (Bailey et al. 2010).
Automated essay scoring appears to perform well for the purpose of assessing
readiness for college writing demands (Elliot et al. 2012; Klobucar et al. 2013;
Ramineni 2013). However, studies using automated scoring to understand the specific
nature and pattern of adults’ writing have focused on college students who are average
writers or English language learners (e.g., Crossley et al. 2012; McNamara et al. 2013;
Reilly et al. 2014; Zhang 2015), while students explicitly identified as low-skilled have
been overlooked. This paper reports an analysis of the writing of low-skilled college
students, utilizing both automated and human scoring.
Our initial aim was to determine whether linguistic variables analyzed using the
Coh-Metrix automated scoring system (Graesser et al. 2004), previously found to
predict human-scored writing quality in a sample of average-performing college
undergraduates (McNamara et al. 2010), also held for our lower-skilled group. To do
this, we analyzed a corpus of writing samples gathered in a larger study (Perin et al.
2015). McNamara et al. (2010) ran all of the Coh-Metrix measures and identified the
three strongest predictors of human-scored writing quality. We analyzed our corpus
using these three variables but did not find them to be statistically significant predictors.
Consequently, we expanded the analysis to repeat McNamara et al. (2010) procedure of
running all of the Coh-Metrix variables in the five linguistic classes used in the prior
study.

Automated Scoring to Supplement Human Writing Evaluation

Automated scoring systems have for the most part been developed to ease the human
burden of scoring students’ writing, and, to validate them, major effort has been
expended in attempts to predict expert human ratings from machine-generated scores
(Bridgeman 2013). When used in the classroom, machine-based writing assessment can
free the teacher from the laborious process of evaluating students’ writing. In conse-
quence, more writing can be assigned, resulting in more instruction and practice for
students (Shermis et al. 2016). Another major purpose of using automated scoring is to
understand reading comprehension and writing processes through the lens of linguistic
variables and text difficulty (Abdi et al. 2016; McNamara et al. 2014).
Objections to automated scoring of writing have been expressed (Deane 2013;
Magliano and Graesser 2012; Perelman 2013; Shermis et al. 2016; Winerip 2012,
Int J Artif Intell Educ

April 22). Critics argue that some scores may be technically flawed, which can lead to
misleading feedback to teachers and students. Further, automated scoring systems
cannot yet interpret the meaning of a piece of writing, identify off-topic content, or
determine whether it is well argued. Moreover, the methodology of studies concluding
that automated scores are just as accurate as human scores (Shermis and Hamner 2013)
has been questioned (Perelman 2013). In addition, Bsignaling effects^ (Shermis et al.
2016, p. 405) have been noted by which a test-taker’s performance may be altered
through knowledge that the writing will be evaluated by a machine rather than a person.
Designers of a Massive Open Online Course (MOOC) noted potential signaling
problems: BAES [i.e. automated essay scoring] largely eliminates human readers,
whereas … what student writers need and desire above all else is a respectful reader
who will attend to their writing with care and respond to it with understanding of its
aims.^ (Comer and White 2016, p. 327).
Balancing the difficulties of automated scoring with their compelling advantage of
speed and volume of scoring, these systems may be best used in conjunction with rather
than as a replacement for human scoring (Ramineni 2013). It should also be kept in
mind that human scoring is hardly without its problems, and may suffer from low
interrater reliability (Bridgeman 2013).

Postsecondary Writing Skills

Comparisons of literacy skills expected upon exit of secondary education and


those required for postsecondary learning reveal considerable misalignment
(Acker 2008; Williamson 2008), which may help explain why many high school
graduates are underprepared for college writing demands (National Center for
Education Statistics 2012). Approximately 68 % of colleges in the United States
offer developmental writing courses, and 14 % of entering students enroll in them,
with a larger proportion (approximately 23 %) in community colleges (Parsad and
Lewis 2003).
In postsecondary courses across the subject areas, students are expected to be able to
respond appropriately to an assigned prompt. To write effectively at this level, the
student needs to be sensitive to the informational needs of readers, and produce text
relevant to a given purpose. Concepts should be discussed coherently and accurately,
and the writing should conform to accepted conventions of spelling and grammar
(Fallahi 2012; MacArthur et al. 2015). Writing is frequently assigned in subject-area
classrooms, when students are expected to summarize, synthesize, analyze and respond
to information taught in a course, and adduce arguments for a position (O’Neill et al.
2012). For many years, two types of writing that have featured prominently in subject-
area assignments at the college level are persuasive writing and written summarization
(Bridgeman and Carlson 1984; Hale et al. 1996; Wolfe 2011), both of which are often
based on sources, demonstrating close connections between writing and reading
comprehension processes (Shanahan 2016). Although essay writing and summarization
can be considered as two different genres, in fact they overlap considerably in the sense
that essays often contain summarized information. Although elementary and secondary
teachers report assigning summary-writing, college instructors rarely do (Burstein et al.
2014), which may be because summarization becomes more of an invisible skill,
embedded in frequently-assigned tasks such as proposal writing.
Int J Artif Intell Educ

Persuasive Writing

In a persuasive essay, the writer states and defends an opinion on an issue. At the
advanced level, argumentative writing, an opposing position is acknowledged and
rebutted, leading to a reasoned conclusion (De La Paz et al. 2012; Hillocks 2011;
Newell et al. 2011). The ability to write a persuasive or argumentative essay is more
than just writing; it is a demonstration of critical thinking (Ferretti et al. 2007). To write
a persuasive essay the writer must first recognize that there is an issue to argue about
and then, through a series of steps, suggest a resolution to the problem (Golder and
Coirier 1994). In the west, the structure of argumentation consists of the statement of a
position, or claim; reasons for the position; an explanation and elaboration of the
reasons; anticipation of alternative positions, or counterargument; rebuttal of the
counterargument; and conclusion (Ferretti et al. 2009; Nussbaum and Schraw 2007).
Students with writing difficulties tend to omit mention of the counterargument (Ferretti
et al. 2000) because of a Bmyside bias^ (Wolfe et al. 2009). However, if prompted,
counterarguments do appear (De La Paz 2005; Ferretti et al. 2009).

Written Summarization

A summary is defined as the condensation of information to its key ideas (A. L. Brown
and Day 1983; Westby et al. 2010). Summarizing information from text requires
both reading comprehension and writing skills (Mateos et al. 2008; Shanahan
2016) and in fact, summary-writing is often used as a measure of reading
comprehension (Li and Kirby 2016; Mason et al. 2013). Based on analysis of
writing samples and think-aloud protocols, an early study (A. L. Brown and Day
1983) proposed that the writing of a summary involves several steps based on
processing information in text. The key process is analyzing the material for
importance. Upon reading a passage, the person intending to summarize it first
eliminates from consideration information considered unimportant or redundant,
then finds and paraphrases a topic sentence if it exists in the text, collapses lists
into superordinate terms, and, finally, composes a topic sentence if one is not
present in the source text.

Persuasive Essay Writing and Written Summarization of Low-Skilled Adults

The limited literature on the writing skills of low-skilled adults has all been based on
human scoring (MacArthur and Lembo 2009; MacArthur and Philippakos 2012, 2013;
MacArthur et al. 2015; Miller et al. 2015; Perin et al. 2013; Perin and Greenberg 1993;
Perin et al. 2003, 2015). These studies have found weaknesses in all areas tested. Low-
skilled college writers produce essays and summaries that are inappropriately short, and
characterized by errors of grammar, style and usage. When summarizing text, these
individuals include only a small proportion of the main ideas from the source and their
persuasive essays often contain material not pertinent to an argument. Overall, studies
have rated the overall quality of essays and summaries written by this population as
low, although structured strategy intervention (MacArthur et al. 2015), online guided
instruction (Miller et al. 2015), and repeated practice (Perin et al. 2013) results in
improvement.
Int J Artif Intell Educ

The availability of automated scoring presents an opportunity to broaden what is


known about writing in this population through analysis of linguistic variables that
would be extremely time-consuming to rate by hand. The additional knowledge about
learners’ abilities could lead to the sharpening and differentiation of instruction to meet
students’ needs (Holtzman et al. 2005). As a step towards this goal, the current study
used the Coh-Metrix scoring engine to determine whether a previously-reported pattern
of predictiveness of the writing quality of typically-achieving college undergraduates
would also be found among lower-skilled students.

Coh-Metrix Automated Scoring

The Coh-Metrix automated scoring engine was designed to assess linguistic dimen-
sions of text difficulty, especially as an alternative to traditional (and problematic)
measures of readability (Graesser et al. 2004). The system, a compilation of linguistic
measures drawn from pre-existing automated systems, centers on the construct of
cohesion, a text characteristic comprising words, phrases and relations that appear
explicitly and permit the mental construction of connections between ideas referred
to in the text. Text cohesiveness thus permits coherence, a cognitive variable describing
how a reader makes meaning of the text (Crossley et al. 2016; Graesser et al. 2004).
Although initially used to analyze text difficulty in the context of reading comprehen-
sion, Coh-Metrix has also been applied to assessment of the quality of student writing,
especially on the assumption that it should be comprehensible to a reader (Crossley
et al. 2016; Crossley and McNamara 2009; McNamara et al. 2010). In its ability to rate
linguistic aspects of writing, Coh-Metrix overlaps with other AES systems such as e-
rater of the Educational Testing Service and Project Essay Grade of Measurement
Incorporated (Burstein et al. 2013; Shermis et al. 2016). However, while many of these
scoring engines are available through commercial vendors, Coh-Metrix is an open
source product and, further, has been utilized in a large body of research (McNamara
et al. 2014).
The assumption underlying the Coh-Metrix work is that effective expression of ideas
in writing depends on those ideas being expressed in cohesive fashion. Therefore,
writing assessed as being of high quality would be expected to score high on Coh-
Metrix variables (McNamara et al. 2010). Coh-Metrix is able to rate input text on over
150 linguistic variables, controlling for text length (Crossley et al. 2012; McNamara
et al. 2014). While writing is a complex act that calls on multiple forms of knowledge,
when using AES it may be most feasible to start with linguistic variables since they are
visible in a text (Crossley et al. 2011). Linguistic variables analyzed by Coh-Metrix are
causal cohesion, or ratio of causal verbs to causal particles; connectives, or words and
phrases that connect ideas and clauses; logical operators such as Bor,^ Band,^ and
Bnot;^ lexical overlap, comprising noun, argument, stem and content word overlap;
semantic coreferentiality, measured as weighted occurrences of words in sentences,
paragraphs, or larger portions of text; anaphoric reference, or the use of pronouns to
refer to previously-mentioned material; polysemy, which analyzes words used in a text
for multiple meanings; hypernymy, which measures the specificity of a word in relation
to the object it depicts; lexical diversity, or the number of different words written; word
frequency, i.e. the frequency of occurrence of words in a 17.9 million word corpus of
English text; word information (concreteness, familiarity, imageability, and
Int J Artif Intell Educ

meaningfulness); syntactic complexity, measured in terms of the mean number of words


appearing before a main verb, the mean number of sentences and clauses, and the mean
number of modifiers per noun phrase; syntactic similarity, or the frequency of use of
any one syntactic construction; and basic text measures including the number of words
written, number of paragraphs, and mean number of sentences per paragraph (Crossley
et al. 2011; McNamara et al. 2010). All of these variables are seen as Bcohesive cues^
that would be expected to characterize student writing judged to be of high quality
(McNamara et al. 2010, p. 62), develop over the school years (Crossley et al. 2011),
and distinguish the writing of native versus non-native speakers of English (Crossley
et al. 2016).
In a study on which the current research is based, McNamara et al. (2010) analyzed a
corpus of persuasive essays written by college undergraduates in order to determine
whether machine-generated linguistic scores would predict human writing quality
ratings. Five groups of Coh-Metrix variables comprising 53 linguistic variables (in
version 2 of the automated engine) were hypothesized by the authors to be related to
the quality of a written text: coreference; connectives; syntactic complexity; lexical
diversity; and word frequency, concreteness and imageability. These variables were
selected for analysis because of earlier work pointing to their relation to reading
comprehension. A corpus of 120 persuasive essays written by adult students in a
freshman composition class at Mississippi State University (MSU) was analyzed.
MSU is a four-year college whose entry criteria appear typical of non-selective institu-
tions (Hoxby and Turner 2015) in terms of high school grade point average and class
standing, and test scores (see http://www.policies.msstate.edu/policypdfs/1229.pdf).
This information is worth noting because data from this group are being compared to
our current sample of community college (also known as two-year or junior college, see
Cohen et al. 2013) developmental education students, i.e., low-skilled adults who,
despite having completed secondary education, were underprepared for the college
curriculum (Bailey et al. 2010).
The MSU essays were written on a variety of self-chosen essay topics; analysis
showed no prompt effects. The writing was untimed, and experience- rather than
source-based. The authors refer to the topics as Bargumentative^ (McNamara et al.
2010, p. 65) but it is noted that the prompts request an opinion without specifying a full
argumentative structure, which would include counterargument and rebuttal (Ferretti
et al. 2009; Hillocks 2011). The same approach was used with the current low-skilled
sample. The corpus of 120 essays was scored by experienced, trained human raters
using a holistic six-point persuasive quality rubric used to score the College Board’s
Scholastic Aptitude Test (SAT). The rubric requires raters to consider point of view;
critical thinking; use of examples, reasons and evidence; organization and focus;
coherence and flow of ideas; skillful use of language; use of vocabulary; variety of
sentence structure; and accuracy of grammar, usage and mechanics (McNamara et al.
2010, Appendix A). The corpus was divided through a median split into high- and low-
proficiency essays. Mean essay length was 749 words for the high- and 700 words for
the low- proficiency essays.
Discriminant analysis was used to predict group membership (high- versus low-
proficiency). The analysis used 53 Coh-Metrix Version 2.0 indices organized in five
classes, called coreference, connectives, syntactic complexity, lexical diversity, and
word characteristics. Several measures were found reliably to predict group
Int J Artif Intell Educ

membership. From these significant predictors, the authors selected those with the three
highest effect sizes: Bnumber of words before the main verb,^ within the syntactic
complexity class; Bmeasured textual lexical diversity (MTLD),^ within the lexical
diversity class; and BCelex logarithm frequency including all words,^ within the word
characteristics class. Subsequent stepwise regression showed that the lower quality
essays displayed more high-frequency words, and the higher quality essays used a more
diverse vocabulary and more complex syntax.

The Current Study

Methods

Participants

Participants were N = 211 students enrolled in two community colleges in a southern


state. All were attending top- and intermediate-level developmental reading and writing
courses one and two levels below the freshman composition level at which students in
McNamara et al. (2010) study were enrolled. The current sample was 54 % Black/
African-American and 64 % female. All participants were proficient speakers of
English although 25 % had a native language other than English. Mean age was
24.55 years, (SD = 10.66), with 71 % aged 18–24 years. Mean reading comprehension
score on the Nelson-Denny Reading Test (a timed multiple-choice measure, J. I. Brown
et al. 1993) was at the 22nd percentile for high school seniors at the end of the school
year, and mean score on the Woodcock-Johnson III Test of Writing Fluency (a timed
measure of general sentence-writing ability requiring constructed responses based on
stimulus words, Woodcock et al. 2001) was at the 27th percentile.

Measures

Text-based persuasive essay and summary writing were group-administered by gradu-


ate assistants in 30-min tasks. The sources for the writing were two articles from the
newspaper USA Today. The articles were selected to correspond to topics taught in
high-enrollment, introductory-level, college-credit general education courses with sig-
nificant reading and writing requirements in the two colleges. College enrollment
statistics showed the highest enrollments in such courses to be psychology and
sociology. The participants had not yet taken the courses. Tables of contents of the
introductory-level psychology and sociology textbooks used at the two colleges were
examined for topics to inform a search for source text. The criteria for the selection of
articles were relevance to topics listed in the tables of contents, word count, and
readability level feasible for the participants. The two articles selected were on the
psychology topic of stress experienced by teenagers and the sociology topic of
intergenerational tensions in the workplace. The psychology topic was used for the
persuasive essay, and the sociology topic was used for the written summarization task.
Several paragraphs were eliminated from each article in order to keep the task to
30 min. (A short assessment was a condition of data collection imposed by the
colleges.) The deleted paragraphs presented examples to illustrate main points in the
Int J Artif Intell Educ

articles and did not add new meaning. Two experienced teachers read the reduced-
length articles and verified that neither cohesion nor basic meaning had been lost as a
result of the reduction.
Text characteristics are shown in Table 1. Word counts for the articles were 650 for
the psychology text and 676 for the sociology text. Flesch–Kincaid readabilities,
obtained from the Microsoft Word software program, were 10th–11th grade, and the
texts measured at 1250–1340 Lexile levels, interpreted on the Lexile website
http://www.lexile.com/about-lexile/grade-equivalent/grade-equivalent-chart/ as
corresponding to an approximately 12th-grade (end-of-year) level. In the context of
the Common Core State Standards, students at the end of secondary education who can
comprehend text at a 1300 Lexile level are considered ready for college and career-
related reading (National Governors’ Association and Council of Chief State School
Officers 2010, Appendix A).
Instructions for the persuasive essay directed students to read the newspaper article
and then write their opinion on a controversy discussed in the article (whether or not
bad behavior among teenagers could be attributed to stress). Length was not specified
for the essay. For the summary, students were instructed to read the text and then
summarize it in one or two paragraphs. Both writing samples were handwritten. On
both tasks, students were told that they could mark the article in any way they wished,
use a dictionary if desired, and were asked to write in their own words. The writing
samples were transcribed, correcting for spelling, capitalization, and punctuation in
order to reduce bias in scoring for meaning (Graham 1999; MacArthur and Philippakos
2010; Olinghouse 2008). However, we did not correct grammar because of the threat of
changing intended meanings in this corpus of generally poorly-written texts.
Each writing sample was scored for quality by trained research assistants who also
had teaching experience. A seven-point holistic scale (based on MacArthur et al. 2015)
was used for the essay and a 16-point analytic scale (based on Westby et al. 2010) for
the summary. The holistic essay quality rubric covered the clarity of expression of
ideas, the organization of the essay, the choice of words, the flow of language and
variety of sentences written, and the use of grammar. Examples of two score points are
shown in the Appendix. The analytic quality rubric used to evaluate the summaries
consisted of four 4-point scales: (1) topic/key sentence, main idea; (2) text structure
(logical flow of information); (3) gist, content (quantity, accuracy, and relevance); and
(4) sentence structure, grammar, and vocabulary. The scores on the four scales were
summed to create a summary quality score. Examples of score points are provided in
the Appendix.
The assistants were trained and then practiced scoring training sets of 10 protocols
obtained previously in pilot testing until they reached a criterion of 80 % exact
agreement. Extensive training was needed because the writing samples were often

Table 1 Text Characteristics

Subject Area Topic Flesch-Kincaid Grade Level Lexile Word Count

Psychology Teen stress 10.5 1250 L 650


Sociology Intergenerational conflict 11.1 1340 L 676
Int J Artif Intell Educ

difficult to understand. When training was complete, pairs of assistants scored


the research set. Inter-rater reliability (Pearson product-moment correlation) was
r = .85 for both the summary and essay. Exact interscorer agreements were 77 %
for the essay and 23 % for summary quality. We attribute the low level of exact
agreement on summary quality to raters’ difficulty in following many of the
writers’ expression of ideas. Interscorer agreement within one point was 100 %
for essay quality and 44 % for the summary. Agreement within two points on the
summary rubric was 71 %.

Procedure

To investigate whether Coh-Metrix text indices had the ability to discriminate good and
poor quality writing samples produced by community college developmental education
students, a modified replication of the analysis of McNamara et al. (2010) was used on
the current persuasive essays and summaries. As the two types of writing had different
scoring rubrics, each corpus was analyzed separately. The study was conducted in two
parts. First, for each essay and summary, the three Coh-Metrix variables reported by
McNamara et al. (2010) as the strongest predictors of human-scored writing quality,
syntactic complexity, lexical diversity, and word frequency, were analyzed for the
current dataset. Second, each essay and summary was re-analyzed using the full set
of Coh-Metrix variables employed by McNamara et al. (2010).
The aim of the McNamara et al. (2010) study was to investigate the use of
discriminant analysis to develop a model to predict writing quality from Coh-
Metrix variables. Their first step was to create two groups, high and low profi-
ciency, based on a median split of human scores. They then compared these
groups using an ANOVA to identify, from the 53 Coh-Metrix variables, those
that had the largest effect sizes between high and low quality writing. After
identifying the variables, they ran correlations to establish that collinearity
would not be a problem for the discriminant analysis. They developed and
tested a discriminant model and then investigated the individual contributions of
the three strongest linguistic indices using a stepwise regression model to predict
the essay rating as a continuous variable. In our first analysis, we followed this
procedure, but only using the McNamara et al. (2010) three strongest variables. In
our second analysis, we followed their full procedure. We note that McNamara et al.
(2010) used version 2.0 of Coh-Metrix, and we used version 3.0, which was the version
available in the public domain at the time of our analysis. The five classes were the same
in the two studies. In version 3.0 the classes were called Referential Cohesion, Lexical
Diversity, Connectives, Syntactic Complexity, and Word Information and covered 52 of
the 53 variables in version 2.0.

Results

For both analyses, based on a median split of the human scores, the low-proficiency
persuasive essays had holistic scores of 0–3 and the high proficiency essays had scores
of 4–7. The low-proficiency written summaries had analytic scores of 0–8, and high
proficiency written summaries scored 9–16.
Int J Artif Intell Educ

Initial Three Measures

Persuasive Essay As seen in Table 2, there was a statistically significant difference


between the two groups in terms of their average score on the essay evaluations but no
significant differences between the groups on the text indices.
Collinearity between the indices was ruled out by correlations, which ranged
from r = −.20 to .09. While there was a significant correlation between word
frequency and lexical diversity, it was relatively moderate (−.20), and was below
the .70 threshold for collinearity issues (Brace et al. 2012). Given the lack of
statistically significant differences between the two groups on the text indices
and the low number of high proficiency essays (only 26 out of 201 with a mean
of 4.04, barely higher than the low score for the group) it is not surprising that
the discriminant analysis was unable to distinguish between the two groups of
essays based on the text indices. The discriminant function provided a one-group
solution, placing all the essays in the low proficiency group (df = 3, n = 211,
χ2 = 3.15, p = .37).
A stepwise regression was then performed to examine if any of the three text indices
were predictive of the human essay evaluation ratings using the continuous scores as
opposed to the dichotomous split score employed for the discriminant analysis. How-
ever, the model was not significant, producing an R2 of .01 (F(3207) = .727, p = .537),
with no significant predictors.

Written Summary As can be seen in Table 3, there was a significant difference


between the groups and two of the differences in the text indices, lexical diversity
and word frequency, approached significance.
As with the persuasive essays, there was no collinearity between measures on the
written summaries; no correlation was above the threshold of .70 (correlations ranged
from r = −.38 to .13). Given the lack of collinearity, a discriminant analysis was
performed. This analysis did not reach a conventional level of significance (df = 3,
n = 211, χ2 = 4.73, p = .19), and a two-group solution that it produced had low
accuracy, with only 55.8 % of the cases correctly classified.
A stepwise regression was performed to examine whether any of the three text
indices were predictive of the human summary ratings using the continuous
scores as opposed to the dichotomous split score employed for the discriminant

Table 2 Initial Three Measures: Descriptive Statistics and ANOVA for Persuasive Essay

Variables Low proficiency High proficiency F (1, 209) Hedge’s g


(Score 0–3) n = 185 (Score 4–7) n = 26

Essay Evaluation 2.37 (0.62) 4.04 (0.20) 182.98* 2.84


Syntactic Complexity** 4.07 (2.34) 3.74 (1.02) 0.50 0.15
Lexical Diversity *** 72.47 (20.06) 78.34 (15.24) 2.06 0.30
Word Frequency**** 3.02 (0.15) 3.03 (0.09) 0.06 0.07

*p < .01, **number of words before the main verb, ***measured textual lexical diversity (MTLD), ****Celex
logarithm frequency including all words
Int J Artif Intell Educ

Table 3 Initial Three Measures: Descriptive Statistics and ANOVA for Written Summary

Variables Low proficiency High proficiency F (1, 209) Hedge’s g


(Score 0–8) n = 111 (Score 9–16) n = 97

Summary Evaluation 6.10 (1.84) 10.42 (1.25) 385.62* 2.71


Syntactic Complexity 3.08 (2.45) 3.09(1.81) 0.50 0.00
Lexical Diversity 66.73 (23.88) 72.93(27.51) 3.03 0.24
Word Frequency 3.12 (0.14) 3.09 (0.13) 3.53 0.22

*p < .001

analysis. While there was a significant model (F(3204) = 2.90, p < .05), it only
accounted for 4 % of the variance, with lexical diversity being the only statis-
tically significant predictor.
In summary, the first part of the study found the three Coh-Metrix measures reported
by McNamara et al. (2010) as the strongest predictors of human-scored writing quality
in a sample of typically-achieving college undergraduates not to predict writing quality
in the current sample of developmental education students.

Full Set of Measures

We proceeded to the second part of the study by analyzing all 52 Coh-Metrix measures
from McNamara et al. (2010) five classes. Following the general procedure of the
earlier study, we created a training set consisting of a randomly-selected set of 75 % of
the cases in order to establish a discriminant model, and used the remaining 25 % to
validate the model.

Persuasive Essay We found that two of the measures, representing two Coh-Metrix
classes, showed statistically significant differences between the high and low proficien-
cy groups: Bargument overlap in all sentences, binary, mean,^ from the Referential
Cohesion class, and Btype-token ratio, all words,^ from the Lexical Diversity class (see
Table 4). In Bargument overlap^ a noun, pronoun or noun phrase in a sentence refers to
a noun, pronoun or noun phrase in another sentence (Graesser et al. 2004). The Btype-
token ratio^ measure divides the number of unique words, i.e., types, in the writing

Table 4 Full Set of Variables: Descriptive Statistics and ANOVA for Persuasive Essay

Variables Low proficiency High proficiency F (1, 209) Hedge’s g


(Score 0–3) n = 185 (Score 4–7) n = 26

Essay Evaluation 2.37 (0.62) 4.04 (0.20) 182.98* 2.84


Argument overlap, all sentences, 0.59 (0.21) 0.50 (0.17) 5.18* 0.44
binary, mean
Lexical diversity, type-token ratio, 0.57 (0.08) 0.52 (0.05) 7.82* 0.65
all words

*p < .01
Int J Artif Intell Educ

sample, by the total word count, i.e. tokens (Crossley and McNamara 2011) as a
measure of vocabulary diversity. The correlation between these two variables,
r = −.046, was not statistically significant, indicating that collinearity would not be a
problem in the discriminant model.
The discriminant function was significant (df = 2, n = 211, χ2 = 14.19,
p < .01) predicting 67 % of the training group and 66 % of the validation
group correctly (see Table 5 for classification table). To better understand the
accuracy of the model based on the groups’ recall scores, defined as the
number of correct classifications divided by the number of missed predictions
plus the number of correct predictions, we calculated precision scores, defined
as the number of correct predictions divided by the number of false positives
plus correct predictions, and F1 scores, which were a weighted mean of the
precision and recall scores (Crossley et al. 2012). The classification table
(Table 5) and the various accuracy scores (Table 6) show that while the model
adequately classified the low proficiency summaries, there were many false
positives in the high proficiency predictions, lowering both the precision and
the F1 scores for the predictions for this grouping.
A stepwise regression was performed to determine whether either of the two
text indices were predictive of the human summary ratings using the continuous
scores, as opposed to the dichotomous split score employed for the discriminant
analysis. The model was statistically significant (F(2208) = 25.83, p < .01),
accounting for 20 % of the variance, with both predictors making a significant
contribution (see Table 7).

Written Summary We found that eight of the measures, representing three Coh-
Metrix classes, showed statistically significant differences between the high and low
proficiency groups, one from the Referential Cohesion, two from Lexical Diversity and
three from Word Information. Selecting the largest effect size in each class provided the
following three variables: Bcontent word overlap, adjacent sentences, proportional,
standard deviation,^ from the Referential Cohesion class; BVOCD, all words,^ from
the Lexical Diversity class, and Bfamiliarity for content words, mean^ from the Word
Information class (see Table 8).
BContent word overlap^ measures the frequency with which content words, i.e., nouns,
verbs (except auxiliary verbs), adjectives, and adverbs, are shared across sentences

Table 5 Full Set of Variables for Persuasive Essay: Predicted Text Type Versus Actual Text Type

Predicted Text Type

Actual Test type Low Proficiency High Proficiency


Training Set
Low Proficiency 90 48
High Proficiency 4 16
Validation Set
Low Proficiency 32 15
High Proficiency 3 3
Int J Artif Intell Educ

Table 6 Full Set of Variables for Persuasive Essay: Recall, Precision and F1 scores for Discriminant Analysis

Group Recall Precision F1

Training Set
Low Proficiency .65 .96 .78
High Proficiency .80 .25 .38
Validation Set
Low Proficiency .68 .91 .78
High Proficiency .50 .16 .25

(Crossley and McNamara 2011). BVOCD^ is a measure of the richness of


vocabulary usage, indicated by the number of different words and the number
of types of words in a language sample (McCarthy and Jarvis 2007). The
Bfamiliarity for content words^ measure refers to the frequency of appearance
of a content word in a large, representative sample of print (Graesser et al.
2004). The correlations between the variables ranged from r = −.09 to .07.
None were statistically significant and all were below the .70 threshold, indi-
cating that collinearity would not be a problem in the discriminant model.
The discriminant function was significant (df = 3, n = 208, χ2 = 34.13, p < .001)
predicting 70 % of the training group and 68 % of the validation group correctly (see
Table 9 for classification table). Recall, precision, and F1 scores were calculated (see
Table 10) to better understand the accuracy of the model (see above for definitions of
scores). This model’s relatively high and stable scores across both the training and
validation set indicated that it is fairly stable predictive model.
A stepwise regression was performed to examine whether either of the two text
indices were predictive of the human summary evaluation ratings using the continuous
scores as opposed to the dichotomous split score employed for the discriminant
analysis. The model was statistically significant (F(3205) = 19.14, p < .001), account-
ing for 22 % of the variance, with all three predictors making a statistically significant
contribution (see Table 11).
In summary, in the second part of the study we found that, for our sample of college
developmental education students, two Coh-Metrix variables predicted human holistic
scores on persuasive essays, and three predicted human analytic scores on written
summaries. There was no overlap between the current measures and those found to
predict human persuasive essay scores of McNamara et al.’s (2010) typically-achieving
college students.

Table 7 Full Set of Variables: Regression Analysis to Predict Persuasive Essay Ratings

Entry Variable added R R2 F change B ß SE

Entry 1 Argument overlap, all sentences, binary, mean .23 .05 11.18* -0.96* -.24 .25
Entry 2 Lexical diversity, type-token ratio, all words .45 .20 38.48* -3.72* -.39 .60

*p < .001
Int J Artif Intell Educ

Table 8 Full Set of Variables: Descriptive Statistics and ANOVA for Written Summary

Variables Low proficiency High proficiency F (1, 209) Hedge’s g


(Score 0–8) n = 111 (Score 9–16) n = 97

Summary Evaluation 6.10 (1.84) 10.42 (1.25) 385.62** 2.71


Content word overlap, adjacent 0.10 (0.06) 0.12 (0.05) 4.21* 0.36
sentences, proportional, standard
deviation
VOCD, all words 21.09 (35.02) 48.00(40.87) 26.03** 0.71
Familiarity for content words, mean 582.37 (8.98) 577.16 (9.40) 16.59** 0.56

*p < .05, **p < .001

Discussion

Using persuasive essays and written summaries from low-skilled, college developmen-
tal education students, this study failed to replicate relationships between Coh-Metrix
automated linguistic scores and human quality scores reported by McNamara et al.
(2010) for typically- achieving college undergraduates. When the study expanded to an
analysis of all Coh-Metrix variables analyzed by McNamara et al. (2010), ten measures
were found to predict human-scored writing quality across the two types of writing.
Two Coh-Metrix measures reliably predicted human persuasive essay quality scores
and eight predicted human summary quality scores.
The first part of the study asked whether three Coh-Metrix text indices, syntactic
complexity, measured as number of words before the main verb; lexical diversity,
measured as measured textual lexical diversity (MTLD); and word frequency, measured
as Celex logarithm frequency including all words, found by McNamara et al. (2010) to
predict human- scored writing quality of persuasive essays written by average-
performing college students, were similarly predictive in a sample of low-skilled adults.
Specifically, we wished to learn whether these linguistic variables would be able to
predict group status (low or high proficiency) for human ratings of both essays and
summaries written by a group of college students attending developmental reading and
writing courses. It was found that these particular text indices did not differ by group or
predict group membership in the low-skilled sample, and that in predicting the rating
score as a continuous variable, only lexical diversity made a significant albeit minor
contribution, only accounting for 3.6 % of the variance.

Table 9 Full Set of Variables for Written Summary: Predicted Text Type Versus Actual Text Type

Predicted Text Type

Actual Test type Low Proficiency High Proficiency


Training Set
Low Proficiency 54 23
High Proficiency 24 53
Validation Set
Low Proficiency 21 12
High Proficiency 5 15
Int J Artif Intell Educ

Table 10 Full Set of Variables for Written Summary: Recall, Precision and F1 scores for Discriminant
Analysis

Group Recall Precision F1

Training Set
Low Proficiency .70 .69 .70
High Proficiency .69 .70 .69
Validation Set
Low Proficiency .64 .81 .71
High Proficiency .75 .56 .64

When examining the means and standard deviations on the three Coh-Metrix text
indices of college students from the McNamara et al. (2010) study and the develop-
mental college students in the current sample, there are similarities in the results on the
essay scores, with the summary scores being lower (see Table 12).
However, when examining the standard deviations, it can be seen that there is a
much wider distribution in the low-proficiency groups, with many of the standard
deviations being more than twice as large. Large standard deviations have also been
reported in previous work on the writing of low-skilled students across the age span
(Carretti et al. 2016; Graham et al. 2014; MacArthur and Philippakos 2013; Perin et al.
2003, 2013). While this problem has ramifications in terms of the lack of statistical
significance in the current analysis (larger standard deviations create more overlap
between groups and make them less distinguishable from each other), the more
important issue is why is there so much variation in the essays and summaries of the
low-proficiency writers.
In essence, we propose that the low-proficiency writers in this population, as judged
by human scorers, found more diverse ways to be poor writers, as evaluated by the
Coh-Metrix text indices, than the poor writers in the McNamara et al. (2010) study.
This finding points to challenges in developing targeted interventions, as the issues that
define their poor writing are wide ranging and idiosyncratic. Current practice in
developmental education is to group this highly varied population in single classrooms
based on standardized placement scores that are not designed to be diagnostic of
specific literacy needs (Hughes and Scott-Clayton 2011), whereas their differing pattern
of skills seems to require more individualized approaches. Structured strategy instruc-
tion in writing that permits small group and individualized assistance (Kiuhara et al.

Table 11 Full Set of Variables: Regression Analysis to Predict Ratings of Written Summary

Entry Variable added R R2 F change B ß SE

Entry 1 Content word overlap, adjacent sentences, .24 .06 12.31* 9.15* .20 2.80
proportional, standard deviation
Entry 2 VOCD, all words .42 .18 29.94* 0.02* .31 0.01
Entry 3 Familiarity for content words, mean .47 .22 11.23* -0.09* -.21 0.03

*p < .001
Int J Artif Intell Educ

Table 12 Means and Standard Deviations for the McNamara et al. (2010) and Current Studies

Variables McNamara et al. (2010) Current Persuasive Essay Current Written Summary

Low High Low High Low High


Proficiency Proficiency Proficiency Proficiency Proficiency Proficiency

Syntactic Complexity 4.06 (0.92) 4.89 (1.24) 4.07 (2.34) 3.74 (1.02) 3.08 (2.45) 3.09 (1.81)
Lexical Diversity 72.64 (10.89) 78.71 (13.19) 72.47 (20.06) 78.34 (15.24) 66.73 (23.88) 72.93 (27.51)
Word Frequency 3.17 (0.08) 3.13 (0.09) 3.02 (0.15) 3.03 (0.09) 3.12 (0.14) 3.09 (0.13)

2012), which has been found effective in experimental work in college developmental
education (MacArthur et al. 2015), seems to be a promising way to reform current
practice.
The current exploration highlights a possible utility for Coh-Metrix, or other auto-
mated scoring systems, to support developmental writing students at the college level.
Given the heterogeneity of the writing skills of students who will inevitably be grouped
for instruction under current practices, perhaps an automated scoring system could be
leveraged by instructors to identify specific writing problems. Focusing only on
difficulties identified by the automated scoring engine rather than a wider range of
skills, some of which students may already have mastered, may lead to more efficient
use of class time. Such diagnostic use of automated scores could potentially free up
instructors to focus more on content and meaning. However, further research would be
needed to test the usefulness of this idea and identify the scoring system and indices
that would best inform such differentiated instruction.
In fact, instruction in writing in the content areas prioritizes aspects of meaning
and interpretation not (yet) amenable to automated scoring. For example, science
and social studies instruction often focuses on critical thinking about controversial
issues (Sampson et al. 2011; Wissinger and De La Paz 2015). It may not be
feasible or considered useful by content-area educators to focus on the types of
linguistic variables generated in automated scoring that predict human-scored
writing quality, such as the global cohesion variable of adjacent overlap of
function words across paragraphs, a predictor of quality reported by Crossley
et al. (2016). Given limitations on instructional time, even more questionable
might be to spend time teaching students not to repeat function words, a negative
predictor of human-scored writing quality (Crossley et al. 2016). While automated
scoring may be used to generate indices of writing quality that may help identify
good and poor writers, the specific areas scored, as they stand now, may not serve
as diagnostics for instructional planning.
The methodology in the current study departed in three ways from that of McNa-
mara et al. (2010). First, the participants had different levels of writing skill (develop-
mental versus average college level). Second, the persuasive writing prompts were
different. Third, the earlier study called for experience-based writing whereas the
current investigation used text-based tasks. Writing from text versus experience may
be an especially important factor in comparing the predictiveness across the studies.
McNamara et al.’s (2010) participants wrote from experience, and specific content was
not an element considered in scoring. In contrast, the current group was directed to refer
Int J Artif Intell Educ

to a text they had read when responding, and scoring took inclusion and relevance of
content into account. Since the scoring of experience-based writing focuses primarily
on style and structure of the response while assessment of text-based writing also
includes judgment of content (Magliano and Graesser 2012), the holistic scores in the
two studies may center on different constructs. Further, the expectation that the same
predictor variables would be operational in persuasive essays (as in the McNamara
et al. 2010 study) and summaries needs further exploration.
Since the Coh-Metrix measures used in the first part of the present study did
not result in a two-group solution (all of the writers were essentially in one
group), we proceeded, in the second part of our study, to analyze the full set of
measures in McNamara et al.’s (2010) five linguistic classes. Running this larger
analysis, we found that, across our two prompts, ten automated measures pre-
dicted human-scored group membership. Several themes emerge when consider-
ing this finding.
First, it is interesting that different automated measures predicted human
scores between the two prompts (persuasive essay and written summary). This
finding supports other work that suggesting that individuals’ writing skill, when
measured on the same construct using different prompts, is not necessarily stable
(Danzak 2011; Olinghouse and Leaird 2009; Olinghouse and Wilson 2013).
These findings suggest that future investigations of the relation between automated
and human scores should measure performance within the same samples on different
prompts and genres.
Second, there are reasons to be cautious about drawing conclusions about the
relation between the automated and human scores when, as in the current case, very
few of a large set of variables were found to be predictive. Of the 104 Coh-Metrix
variables tested (52 variables tested for each writing prompt) in the second part of our
study, only two reliably differentiated persuasive essay quality and only eight differ-
entiated written summary quality. In fact, given the large number of variables tested,
these may be chance findings. Further, trends towards any particular pattern of results
were not evident in the current dataset, evidenced by low F values (less than 25 % of
the F values were larger than 2, which is around a p value of .2) and the standard
deviations were large. The large degree of variation in the data, when the Coh-Metrix
variables were run from all of McNamara et al.’s (2010) five classes, supports the
suggestion made above that there may be many ways to write poorly.
In fact, the investigation was atheoretical in regard to writing ability, as our concern
was simply whether the automated measures were able to predict the human scores.
However, in an investigation guided by a theory of the relation between linguistic
measures and overall writing quality, the number of variables found to be reliable
predictors would not matter, because theory-based expectations about variables that
might discriminate high from low proficiency writers could be tested. In this case, even
if all of the theoretically relevant, individual automated measures were not statistically
significant predictors, the right combination could produce an effective discriminant
model to distinguish the characteristics of writing of individuals who vary in
proficiency.
Another suggestion for future directions in research on the relation between
human and automated scoring is to compare outcomes for three distinct populations
varying in writing quality, highly proficient writers such as graduate students,
Int J Artif Intell Educ

average-achieving writers such as the typically-performing undergraduates studied


by McNamara et al. (2010), and low proficiency college developmental education
writers as in the current study.
Further, the alignment of human and automated scores should be explored, as
developmental evidence on college students may show different growth patterns on
human- and machine-scored linguistic and proficiency variables (Crossley et al. 2016).
Overall, it will be important, as research in writing skills using automated scoring goes
forward, to verify the accuracy of human scores and the theoretical relevance of
automated linguistic scores, and consider how each type of evaluation might most
usefully inform classroom instruction. In this context, it will be important to compare
outcomes on the same populations for different automated scoring engines that are
becoming available. Findings for Coh-Metrix have been reported in a large number of
research publications but these are early days in this field, and the relative utility of the
various available scoring engines for improving educational outcomes remains to be
demonstrated.

Acknowledgments The writing samples analyzed in this study were collected under funding by the Bill &
Melinda Gates Foundation to the Community College Research Center, Teachers College, Columbia Univer-
sity for a project entitled BAnalysis of Statewide Developmental Education Reform: Learning Assessment
Study.^ Special thanks to Jian-Ping Ye, Geremy Grant and Natalie Portillo for assistance with data entry.

Appendix

Persuasive Essay Rubric: Examples of Quality Score Points

Score = 3
Essay has topic and a few ideas but little elaboration. Ideas not explained well
or somewhat difficult to understand. Source text not mentioned or referred to
vaguely. Less important details rather than main ideas from the source text used
to support argument. Some ideas conveyed inaccurately. If personal experience
mentioned, largely irrelevant or mostly unclear. Organization may be weak.
Essay may lack introduction and transitions among ideas. Word choice may be
repetitive or vague. Sentences may be simple or lack variety. Errors in grammar
and usage.
Score = 5
Clear topic with related ideas supported with details and some elaboration.
Source text mentioned explicitly. Some main ideas from text used to support the
argument. Most of ideas from source text conveyed accurately. If personal
experience mentioned, mostly relevant and clear. Essay well organized, with
introduction, sequence of ideas with some transitions, and conclusion. Word
choice generally appropriate and some variety of sentences. Occasional minor errors
in grammar or usage.

Summary Rubric: Examples of Quality Score Points

Topic/key sentence, main idea


Int J Artif Intell Educ

0 (None): Statements do not link to central topic.


1 (Little/few/some): Ideas link to central topic, but no topic/key sentence brings ideas
together.
2 (Many/most): Topic/key sentence states some aspect of the content, but statement is
vague.
3 (All):Topic/key sentence is minimal but clear.
4 (Best): Topic/key sentence is comprehensive.
Gist, content (quantity, accuracy, and relevance)
0 (None):Statements are not related to the passage or do not communicate information
from the passage.
1 (Little/few/some): Some information from the passage is included, but some
important ideas are missing; some ideas may be irrelevant or inaccurate.
2 (Many/most): Most information from the passage is included; some ideas may be
irrelevant or inaccurate; some information/ideas are missing.
3 (All): All relevant information from the passage is included.
4 (Best): All relevant ideas from the passage are included, accurately represented, and
appropriately elaborated.

Summary Rubric: Example of One Analytic Score Point on Each Element

Score of 3 on each of four quality indicators

– Topic/key sentence, main idea: Topic/key sentence is minimal but clear.


– Text structure: Logical flow of information.
– Gist, content (quantity, accuracy, and relevance): All relevant information from
passage included.
– Sentence structure, grammar, and vocabulary: All sentences complete, some elab-
oration and/or dependent clauses, generally appropriate word choice, some variety
of sentence structure, occasional minor errors of grammar and/or usage.

References

Abdi, A., Idris, N., Alguliyev, R. M., & Alguliyev, R. M. (2016). An automated summarization assessment
algorithm for identifying summarizing strategies. PloS One, 11(1), 1–34. doi:10.1371/journal.
pone.0145809.
Acker, S. R. (2008). Preparing high school students for college-level writing: using an e-portfolio to support a
successful transition. The Journal of General Education, 57(1), 1–15. doi:10.1353/jge.0.0012.
Bailey, T. R., Jeong, D.-W., & Cho, S.-W. (2010). Referral, enrollment, and completion in developmental
education sequences in community colleges. Economics of Education Review, 29(2), 255–270.
Brace, N., Kemp, R., & Snelgar, R. (2012). SPSS for psychologists: a guide to data analysis using (5th ed.).
New York, NY: Routledge.
Bridgeman, B. (2013). Human ratings and automated essay evaluation. In M. D. Shermis & J. Burstein (Eds.),
Handbook of automated essay evaluation: current applications and new directions (pp. 221–232). New
York, NY: Routledge.
Bridgeman, B., & Carlson, S. B. (1984). Survey of academic writing tasks. Written Communication, 1(2),
247–280. doi:10.1177/0741088384001002004.
Int J Artif Intell Educ

Brown, A. L., & Day, J. D. (1983). Macrorules for summarizing texts: the development of expertise. Journal
of Verbal Learning and Verbal Behavior, 22, 1–14.
Brown, J. I., Fishco, V. V., & Hanna, G. S. (1993). The Nelson-Denny reading test, forms G and H. Itasca, IL:
Riverside/Houghton-Mifflin.
Burstein, J., Tetreault, J., & Madnani, N. (2013). The e-rater® automated essay scoring system. In M. D.
Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: current applications and new
directions. New York, NY: Routledge.
Burstein, J., Holtzman, S., Lentini, J., Molloy, H., Shore, J., Steinberg, J., … Elliot, N. (2014). Genre research
and automated writing evaluation: using the lens of genre to understand exposure and readiness in
teaching and assessing school and workplace writing. Paper presented at the National Council on
measurement in education (NCME), April 2014, Philadelphia, PA.
Carretti, B., Motta, E., & Re, A. M. (2016). Oral and written expression in children with reading compre-
hension difficulties. Journal of Learning Disabilities, 49(1), 65–76. doi:10.1177/0022219414528539.
Cohen, A. M., Brawer, F. B., & Kisker, C. B. (2013). The American community college (6th ed.). Boston, MA:
Wiley.
Comer, D. K., & White, E. M. (2016). Adventuring into MOOC writing assessment: challenges, results, and
possibilities. College Composition and Communication, 67(3), 318–359.
Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2
writing. Journal of Second Language Writing, 18(2), 119–135.
Crossley, S. A., & McNamara, D. S. (2011). Predicting second language writing proficiency: the roles of
cohesion and linguistic sophistication. Journal of Research in Reading. doi:10.1111/j.1467-
9817.2010.01449.x.
Crossley, S. A., Weston, J. L., Sullivan, S. T. M., & McNamara, D. S. (2011). The development of writing
proficiency as a function of grade level: a linguistic analysis. Written Communication, 28(3), 282–311.
doi:10.1177/0741088311410188.
Crossley, S. A., Salsbury, T., & McNamara, D. S. (2012). Predicting the proficiency level of language learners
using lexical indices. Language Testing, 29(2), 243–263. doi:10.1177/0265532211419331.
Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The development and use of cohesive devices in L2
writing and their relations to judgments of essay quality. Journal of Second Language Writing, 32, 1–16.
doi:10.1016/j.jslw.2016.01.003.
Danzak, R. L. (2011). The integration of lexical, syntactic, and discourse features in bilingual adolescents’
writing: an exploratory approach. Language, Speech, and Hearing Services in Schools, 42(4), 491–505.
De La Paz, S. (2005). Effects of historical reasoning instruction and writing strategy mastery in culturally and
academically diverse middle school classrooms. Journal of Educational Psychology, 97(2), 139–156.
doi:10.1037/0022-0663.97.2.139.
De La Paz, S., Ferretti, R., Wissinger, D., Yee, L., & MacArthur, C. A. (2012). Adolescents’ disciplinary use
of evidence, argumentative strategies, and organizational structure in writing about historical controver-
sies. Written Communication, 29(4), 412–454. doi:10.1177/0741088312461591.
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct.
Assessing Writing, 18(1), 7–24. doi:10.1016/j.asw.2012.10.002.
Deane, P., & Quinlan, T. (2010). What automated analyses of corpora can tell us about students’ writing skills.
Journal of Writing Research, 2(2), 151–177. doi:10.17239/jowr-2010.02.02.4.
Elliot, N., Deess, P., Rudniy, A., & Joshi, K. (2012). Placement of students into first-year writing courses.
Research in the Teaching of English, 46(3), 285–313.
Fallahi, C. R. (2012). Improving the writing skills of college students. In E. L. Grigorenko, E. Mambrino, & D.
D. Preiss (Eds.), Writing: a mosaic of new perspectives (pp. 209–219). New York, NY: Psychology Press.
Ferretti, R. P., MacArthur, C. A., & Dowdy, N. S. (2000). The effects of an elaborated goal on the persuasive
writing of students with learning disabilities and their normally achieving peers. Journal of Educational
Psychology, 92(4), 694–702. doi:10.10377//0022:2–0663.92.4.694.
Ferretti, R. P., Andrews-Weckerly, S., & Lewis, W. E. (2007). Improving the argumentative writing of students
with learning disabilities: descriptive and normative considerations. Reading and Writing Quarterly,
23(3), 267–285.
Ferretti, R. P., Lewis, W. E., & Andrews-Weckerly, S. (2009). Do goals affect the structure of students’
argumentative writing strategies? Journal of Educational Psychology, 101(3), 577–589. doi:10.1037
/a0014702.
Golder, C., & Coirier, P. (1994). Argumentative text writing: developmental trends. Discourse Processes,
18(2), 187–210. doi:10.1080/01638539409544891.
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: analysis of text on
cohesion and language. Behavioral Research Methods, Instruments and Computers, 36(2), 193–202.
Int J Artif Intell Educ

Graham, S. (1999). Handwriting and spelling instruction for students with learning disabilities: a review.
Learning Disability Quarterly, 22(2), 78–98. doi:10.2307/1511268.
Graham, S., Hebert, M., Sandbank, M. P., & Harris, K. R. (2014). Assessing the writing achievement of young
struggling writers: application of generalizability theory. Learning Disability Quarterly (online first).
doi:10.1177/0731948714555019.
Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). A study of writing tasks
assigned in academic degree programs (RR-95-44, TOEFL-RR-54). Retrieved from Princeton, NJ.
Hillocks, G. (2011). Teaching argument writing, grades 6–12: supporting claims with relevant evidence and
clear reasoning. Portsmouth, NH: Heinemann.
Holtzman, J. M., Elliot, N., Biber, C. L., & Sanders, R. M. (2005). Computerized assessment of dental student
writing skills. Journal of Dental Education, 69(2), 285–295.
Hoxby, C. M., & Turner, S. (2015). What high-achieving low-income students know about college. The
American Economic Review, 105(5), 514–517. doi:10.1257/aer.p20151027.
Hughes, K. L., & Scott-Clayton, J. (2011). Assessing developmental assessment in community colleges.
Retrieved from CCRC Working Paper No. 19. New York.
Kiuhara, S., O’Neill, R., Hawken, L., & Graham, S. (2012). The effectiveness of teaching 10th-grade students
STOP, AIMS, and DARE for planning and drafting persuasive text. Exceptional Children, 78(3), 335–
355.
Klobucar, A., Elliot, N., Deess, P., Rudniy, O., & Joshi, K. (2013). Automated scoring in context: rapid
assessment for placed students. Assessing Writing, 18(1), 62–84. doi:10.1016/j.asw.2012.10.001.
Li, M., & Kirby, J. R. (2016). The effects of vocabulary breadth and depth on English reading. Applied
Linguistics, 36(5), 611–634. doi:10.1093/applin/amu007.
MacArthur, C. A., & Lembo, L. (2009). Strategy instruction in writing for adult literacy learners. Reading and
Writing: An Interdisciplinary Journal, 22(9), 1021–1039.
MacArthur, C. A., & Philippakos, Z. (2010). Instruction in a strategy for compare-contrast writing.
Exceptional Children, 76(4), 438–456.
MacArthur, C. A., & Philippakos, Z. A. (2012). Strategy instruction with college basic writers: a design study.
In C. Gelati, B. Arfé, & L. Mason (Eds.), Issues in writing research (pp. 87–106). Padova: CLEUP.
MacArthur, C. A., & Philippakos, Z. A. (2013). Self-regulated strategy instruction in developmental writing: a
design research project. Community College Review, 41(2), 176–195. doi:10.1177/0091552113484580.
MacArthur, C. A., Philippakos, Z. A., & Ianetta, M. (2015). Self-regulated strategy instruction in college
developmental writing. Journal of Educational Psychology, 107(3), 855–867. doi:10.1037/edu0000011.
MacArthur, C. A., Philippakos, Z. A., & Graham, S. (2016). A multicomponent measure of writing motivation
with basic college writers. Learning Disability Quarterly, 39(1), 31–43. doi:10.1177/0731948715583115.
Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student-constructed responses.
Behavior Research Methods (Online), 44(3), 608–621. doi:10.3758/s13428-012-0211-3.
Mason, L. H., Davison, M. D., Hammer, C. S., Miller, C. A., & Glutting, J. J. (2013). Knowledge, writing, and
language outcomes for a reading comprehension and writing intervention. Reading and Writing: An
Interdisciplinary Journal, 26(7), 1133–1158. doi:10.1007/s11145-012-9409-0.
Mateos, M., Martin, E., Villalon, R., & Luna, M. (2008). Reading and writing to learn in secondary education:
online processing activity and written products in summarizing and synthesizing tasks. Reading and
Writing: An Interdisciplinary Journal, 21, 675–697.
McCarthy, P. M., & Jarvis, S. (2007). Vocd: a theoretical and empirical evaluation. Language Testing, 24(4),
459–488. doi:10.1177/0265532207080767.
McNamara, D. S., Crossley, S. A., & McCarthy, P. M. (2010). Linguistic features of writing quality. Written
Communication, 27(1), 57–86. doi:10.1177/0741088309351547.
McNamara, D. S., Crossley, S. A., & Roscoe, R. (2013). Natural language processing in an intelligent writing
strategy tutoring system. Behavior Research Methods (Online), 45(2), 499–515. doi:10.3758/s13428-012-
0258-1.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and
discourse with Coh-Metrix. New York, NY: Cambridge University Press.
Miller, L. C., Russell, C. L., Cheng, A.-L., & Skarbek, A. J. (2015). Evaluating undergraduate nursing
students’ self-efficacy and competence in writing: effects of a writing intensive intervention. Nurse
Education in Practice, 15(3), 174–180. doi:10.1016/j.nepr.2014.12.002.
National Center for Education Statistics. (2012). The nation’s report card: writing 2011 (NCES 2012–470).
Washington, D.C.: Institute of Education Sciences, U.S. Department of Education. Available http://nces.
ed.gov/nationsreportcard/pdf/main2011/2012470.pdf.
Int J Artif Intell Educ

National Governors’ Association and Council of Chief State School Officers. (2010). Common core state
standards: English language arts and literacy in history/social studies, science, and technical subjects.
Washington, DC: Author. Available at http://www.corestandards.org/.
Newell, G. E., Beach, R., Smith, J., & VanDerHeide, J. (2011). Teaching and learning: argumentative reading
and writing: a review of research. Reading Research Quarterly, 46(3), 273–304. doi:10.1598/RRQ.46.3.4.
Nussbaum, E. M., & Schraw, G. (2007). Promoting argument-counterargument integration in students’
writing. The Journal of Experimental Education, 76(1), 59–92.
O’Neill, P., Adler-Kassner, L., Fleischer, C., & Hall, A. (2012). Creating the framework for success in
postsecondary writing. College English, 74(6), 520–533.
Olinghouse, N. G. (2008). Student- and instruction-level predictors of narrative writing in third-grade students.
Reading and Writing: An Interdisciplinary Journal, 21(1–2), 3–26.
Olinghouse, N. G., & Leaird, J. T. (2009). The relationship between measures of vocabulary and narrative
writing quality in second- and fourth-grade students. Reading and Writing: An Interdisciplinary Journal,
22, 545–565.
Olinghouse, N. G., & Wilson, J. (2013). The relationship between vocabulary and writing quality in three genres.
Reading and Writing: An Interdisciplinary Journal, 26(1), 45–65. doi:10.1007/s11145-012-9392-5.
Parsad, B., & Lewis, L. (2003). Remedial education at degree-granting postsecondary institutions in fall
2000: Statistical analysis report (NCES 2004–010). Washington D.C.: U.S. Department of Education,
National Center for Education Statistics. Retrieved from http://nces.ed.gov/pubs2004/2004010.pdf.
Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hamner, BContrasting state-of-the-art automated
scoring of essays: analysis^ Journal of Writing Assessment, 6(1), not paginated. Available at http://www.
journalofwritingassessment.org/article.php?article=69.
Perin, D., & Greenberg, D. (1993). Relationship between literacy gains and length of stay in basic education
program for health care workers. Adult Basic Education, 3(3), 171–186.
Perin, D., Keselman, A., & Monopoli, M. (2003). The academic writing of community college remedial
students: text and learner variables. Higher Education, 45(1), 19–42.
Perin, D., Bork, R. H., Peverly, S. T., & Mason, L. H. (2013). A contextualized curricular supplement for
developmental reading and writing. Journal of College Reading and Learning, 43(2), 8–38.
Perin, D., Raufman, J. R., & Kalamkarian, H. S. (2015). Developmental reading and English assessment in a
researcher-practitioner partnership (CCRC Working Paper No. 85). New York, NY: Community College
Research Center, Teachers College, Columbia University. Available at http://ccrc.tc.columbia.
edu/publications/developmental-reading-english-assessment-researcher-practitioner-partnership.html.
Ramineni, C. (2013). Validating automated essay scoring for online writing placement. Assessing Writing,
18(1), 40–61. doi:10.1016/j.asw.2012.10.005.
Reilly, E. D., Stafford, R. E., Williams, K. M., & Corliss, S. B. (2014). Evaluating the validity and
applicability of automated essay scoring in two massive open online courses. International Review of
Research in Open and Distance Learning, 15(5), 83–99.
Sampson, V., Grooms, J., & Walker, J. P. (2011). Argument-driven inquiry as a way to help students learn how
to participate in scientific argumentation and craft written arguments: an exploratory study. Science
Education, 95(2), 217–257.
Shanahan, T. (2016). Relationships between reading and writing development. In C. A. MacArthur, S.
Graham, & J. Fitzgerald (Eds.), Handbook of writing research (2nd ed., pp. 194–207). New York, NY:
Guilford.
Shermis, M. D., & Hamner, B. (2013). Contrasting state-of-the-art automated scoring of essays. In M. D.
Shermis & J. Burstein (Eds.), Handbook of automated essay evaluation: current applications and new
directions (pp. 313–346). New York, NY: Routledge.
Shermis, M. D., Burstein, J., Elliot, N., Miel, S., & Foltz, P. W. (2016). Automated writing evaluation: an
expanding body of knowledge. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.), Handbook of
writing research (2nd ed., pp. 395–409). New York, NY: Guilford.
Weigle, S. C. (2013). English language learners and automated scoring of essays: critical considerations.
Assessing Writing, 18(2), 85–99. doi:10.1016/j.asw.2012.10.006.
Westby, C., Culatta, B., Lawrence, B., & Hall-Kenyon, K. (2010). Summarizing expository texts. Topics in
Language Disorders, 30(4), 275–287. doi:10.1097/TLD.0b013e3181ff5a88.
Williamson, G. (2008). A text readability continuum for postsecondary readiness. Journal of Advanced
Academics, 19(4), 602–632. doi:10.4219/jaa-2008-832.
Wilson, J., Olinghouse, N. G., McCoach, D. B., Santangelo, T., & Andrada, G. N. (2016). Comparing the
accuracy of different scoring methods for identifying sixth graders at risk of failing a state writing
assessment. Assessing Writing, 27(1), 11–23. doi:10.1016/j.asw.2015.06.003.
Int J Artif Intell Educ

Winerip, M. (2012, April 22). Facing a robo-grader? Just keep obfuscating mellifluously. New York Times.
Retrieved from http://www.nytimes.com/2012/04/23/education/robo-readers-used-to-grade-test-essays.html
Wissinger, D. R., & De La Paz, S. (2015). Effects of critical discussions on middle school students’ written
historical arguments. Journal of Educational Psychology (online first). doi:10.1037/edu0000043.
Wolfe, C. R. (2011). Argumentation across the curriculum. Written Communication, 28(1), 193–219.
doi:10.1177/0741088311399236.
Wolfe, C. R., Britt, M. A., & Butler, J. A. (2009). Argumentation schema and the myside bias in written
argumentation. Written Communication, 26(2), 183–209. doi:10.1177/0741088309333019.
Woodcock, R. W., McGrew, K. S., & Mather, N. (2001). Woodcock-Johnson III tests of achievement and tests
of cognitive abilities. Itasca, IL: Riverside Publishing.
Zhang, R. (2015). A Coh-Metrix study of writings by majors of mechanic engineering in the vocational
college. Theory and Practice in Language Studies, 5(9), 1929–1934. doi:10.17507/tpls.0509.23.

S-ar putea să vă placă și