Sunteți pe pagina 1din 5

Open access, freely available online

Policy Forum

Data Cleaning: Detecting, Diagnosing,


and Editing Data Abnormalities
Jan Van den Broeck*, Solveig Argeseanu Cunningham, Roger Eeckels, Kobus Herbst

I
n clinical epidemiological research, less attention. As a result, even though
errors occur in spite of careful the importance of data-handling Box 1. Terms Related to Data
study design, conduct, and procedures is being underlined in good Cleaning
implementation of error-prevention clinical practice and data management
Data cleaning: Process of detecting,
strategies. Data cleaning intends to guidelines [1–3], there are important
diagnosing, and editing faulty data.
identify and correct these errors or at gaps in knowledge about optimal data-
least to minimize their impact on study handling methodologies and standards Data editing: Changing the value of data
results. Little guidance is currently of data quality. The Society for Clinical shown to be incorrect.
available in the peer-reviewed literature Data Management, in their guidelines Data flow: Passage of recorded
on how to set up and carry out cleaning for good clinical data management information through successive
efforts in an efficient and ethical practices, states: “Regulations and information carriers.
way. With the growing importance guidelines do not address minimum Inlier: Data value falling within the
of Good Clinical Practice guidelines acceptable data quality levels for clinical expected range.
and regulations, data cleaning and trial data. In fact, there is limited
Outlier: Data value falling outside the
other aspects of data handling will published research investigating the
expected range.
emerge from being mainly gray- distribution or characteristics of clinical
literature subjects to being the focus trial data errors. Even less published Robust estimation: Estimation of
of comparative methodological information exists on methods of statistical parameters, using methods that
studies and process evaluations. quantifying data quality” [4]. are less sensitive to the effect of outliers
We present a brief summary of the Data cleaning is emblematic of the than more conventional methods.
scattered information, integrated into historical lower status of data quality
a conceptual framework aimed at issues and has long been viewed as
detection and handling of missing
assisting investigators with planning a suspect activity, bordering on data
data have received separate attention
and implementation. We recommend manipulation. Armitage and Berry
[9–18], the data-cleaning process,
that scientific reports describe data- [5] almost apologized for inserting
as a whole, with all its conceptual,
cleaning methods, error types and a short chapter on data editing in
organizational, logistical, managerial,
rates, error deletion and correction their standard textbook on statistics
and statistical-epidemiological
rates, and differences in outcome with in medical research. Nowadays,
aspects, has not been described
and without remaining outliers. whenever discussing data cleaning,
or studied comprehensively. In
it is still felt to be appropriate to
The History of Data Cleaning statistical textbooks and non-peer-
start by saying that data cleaning
With Good Clinical Practice guidelines can never be a cure for poor study
being adopted and regulated in more design or study conduct. Concerns Citation: Van den Broeck J, Argeseanu Cunningham
and more countries, some important about where to draw the line between S, Eeckels R, Herbst K (2005) Data cleaning: Detecting,
shifts in clinical epidemiological data manipulation and responsible diagnosing, and editing data abnormalities. PLoS Med
2(10): e267.
research practice can be expected. One data editing are legitimate. Yet all
of the expected developments is an studies, no matter how well designed Copyright: © 2005 Van den Broeck et al. This is an
increased emphasis on standardization, and implemented, have to deal with open-access article distributed under the terms of the
Creative Commons Attribution License, which permits
documentation, and reporting of data errors from various sources and their unrestricted use, distribution, and reproduction in
handling and data quality. Indeed, effects on study results. This problem any medium, provided the original work and source
are properly cited.
in scientific tradition, especially in occurs as much to experimental as to
academia, study validity has been observational research and clinical trials Jan Van den Broeck is an epidemiologist, and Kobus
discussed predominantly with regard [6,7]. Statistical societies recommend Herbst is a public-health physician at the Africa
Centre for Health and Population Studies, Mtubatuba,
to study design, general protocol that description of data cleaning be a South Africa. Solveig Argeseanu Cunningham is
compliance, and the integrity and standard part of reporting statistical a demographer at the University of Pennsylvania,
experience of the investigator. Data methods [8]. Exactly what to report Philadelphia, Pennsylvania, United States of America.
Roger Eeckels is Professor Emeritus of Pediatrics at
handling, although having an equal and under what circumstances remains the Catholic University of Leuven, Leuven, Belgium.
potential to affect the quality of study mostly unanswered. In practice, it is
Competing Interests: The authors have declared
results, has received proportionally rare to find any statements about data- that no competing interests exist.
cleaning methods or error rates in
The Policy Forum allows health policy makers around medical publications. *To whom correspondence should be addressed.
the world to discuss challenges and opportunities for E-mail: jan.broeck@africacentre.ac.za
improving health care in their societies.
Although certain aspects of data
cleaning such as statistical outlier DOI: 10.1371/journal.pmed.0020267

PLoS Medicine | www.plosmedicine.org 0966 October 2005 | Volume 2 | Issue 10 | e267


possible in a large questionnaire survey.
Most problems are due to human error.
Inaccuracy of a single measurement
and data point may be acceptable,
and related to the inherent technical
error of the measurement instrument.
Hence, data cleaning should focus
on those errors that are beyond small
technical variations and that constitute
a major shift within or beyond the
population distribution. In turn, data
cleaning must be based on knowledge
of technical errors and expected ranges
of normal values.
Some errors deserve priority, but
which ones are most important is
highly study-specific. In most clinical
epidemiological studies, errors that
need to be cleaned, at all costs, include
missing sex, sex misspecification,
birth date or examination date errors,
duplications or merging of records,
and biologically impossible results.
DOI: 10.1371⁄journal.pmed.0020267.g001 For example, in nutrition studies, date
Figure 1. A Data-Cleaning Framework errors lead to age errors, which in turn
(Illustration: Giovanni Maki) lead to errors in weight-for-age scoring
and, further, to misclassification of
reviewed literature, there is scattered always immediately clear whether a subjects as under- or overweight.
information, which we summarize in data point is erroneous. Many times, Errors of sex and date are
this paper, using the concepts and what is detected is a suspected data particularly important because
definitions shown in Box 1. point or pattern that needs careful they contaminate derived variables.
The complete process of quality examination. Similarly, missing values Prioritization is essential if the study is
assurance in research studies includes require further examination. Missing under time pressures or if resources for
error prevention, data monitoring, data values may be due to interruptions data cleaning are limited.
cleaning, and documentation. There of the data flow or the unavailability
are proposed models that describe of the target information. Hence, Screening Phase
total quality assurance as an integrated predefined rules for dealing with errors When screening data, it is convenient
process [19]. However, we concentrate and true missing and extreme values to distinguish four basic types of
here on data cleaning and, as a second are part of good practice. One can oddities: lack or excess of data; outliers,
aim of the paper, separately describe a screen for suspect features in survey including inconsistencies; strange
framework for this process. Our focus questionnaires, computer databases, patterns in (joint) distributions; and
is primarily on medical research and or analysis datasets. In small studies, unexpected analysis results and other
on practical relevance for the medical with the investigator closely involved types of inferences and abstractions
investigator. at all stages, there may be little or no (Table 1). Screening methods need
distinction between a database and an not only be statistical. Many outliers are
Data Cleaning as a Process analysis dataset. detected by perceived nonconformity
Data cleaning deals with data problems The diagnostic and treatment phases with prior expectations, based on
once they have occurred. Error- of data cleaning require insight into the investigator’s experience, pilot
prevention strategies can reduce many the sources and types of errors at all studies, evidence in the literature, or
problems but cannot eliminate them. stages of the study, during as well as common sense. Detection may even
We present data cleaning as a three- after measurement. The concept of happen during article review or after
stage process, involving repeated cycles data flow is crucial in this respect. After publication.
of screening, diagnosing, and editing measurement, research data undergo What can be done to make screening
of suspected data abnormalities. Figure repeated steps of being entered objective and systematic? To allow the
1 shows these three steps, which can into information carriers, extracted, researcher to understand the data
be initiated at three different stages transferred to other carriers, edited, better, it should be examined with
of a study. Many data errors are selected, transformed, summarized, simple descriptive tools. Standard
detected incidentally during study and presented. It is important to realize statistical packages or even spreadsheets
activities other than data cleaning. that errors can occur at any stage of make this easy to do [20,21]. For
However, it is more efficient to the data flow, including during data identifying suspect data, one can first
detect errors by actively searching cleaning itself. Table 1 illustrates some predefine expectations about normal
for them in a planned way. It is not of the sources and types of errors ranges, distribution shapes, and strength

PLoS Medicine | www.plosmedicine.org 0967 October 2005 | Volume 2 | Issue 10 | e267


Table 1. Issues to Be Considered during Data Collection, Management, and Analysis of a Questionnaire Study
Data Stage Sources of Problems: Lack or Excess of Data Sources of Problems: Outliers and Inconsistencies

Questionnaire Form missing Correct value filled out in wrong box


Form double, collected repeatedly Not readable
Answering box or options list left blank Writing error
More than one option selected when not allowed Answer given is out of expected (conditional) range
Database Lack or excess of data carried over from questionnaire Outliers and inconsistencies carried over from questionnaire
Form or field not entered Value incorrectly entered
Data erroneously entered twice Value incorrectly changed during previous data cleaning
Value entered in wrong field Transformation (programming) error
Inadvertent deletions and duplications during database handling
Analysis dataset Lack or excess of data carried over from database Outliers and inconsistencies carried over from database
Data extraction or transfer error Data extraction or transfer error
Deletions or duplications by analyst Sorting errors (spreadsheets)
Data-cleaning errors

DOI: 10.1371/journal.pmed.0020267.t001

of relationships [22]. Second, the diagnosis will be less straightforward. times. Such intervals can be set wider
application of these criteria can be In these cases, it is necessary to apply a if the analysis foresees using age or
planned beforehand, to be carried out combination of diagnostic procedures. follow-up time as a continuous variable.
during or shortly after data collection, One procedure is to go to previous Finding an acceptable value does
during data entry, and regularly stages of the data flow to see whether not always depend on measuring or
thereafter. Third, comparison of the a value is consistently the same. This remeasuring. For some input errors,
data with the screening criteria can be requires access to well-archived and the correct value is immediately
partly automated and lead to flagging of documented data with justifications obvious, e.g., if values of infant length
dubious data, patterns, or results. for any changes made at any stage. are noted under head circumference
A special problem is that of erroneous A second procedure is to look for and vice versa. This example again
inliers, i.e., data points generated by information that could confirm the illustrates the usefulness of the
error but falling within the expected true extreme status of an outlying data investigator’s subject-matter knowledge
range. Erroneous inliers will often point. For example, a very low score for in the diagnostic phase. Substitute
escape detection. Sometimes, inliers weight-for-age (e.g., −6 Z-scores) might code values for missing data should be
are discovered to be suspect if viewed in be due to errors in the measurement corrected before analysis.
relation to other variables, using scatter of age or weight, or the subject may be During the diagnostic phase, one may
plots, regression analysis, or consistency extremely malnourished, in which case have to reconsider prior expectations
checks [23]. One can also identify some other nutritional variables should also and⁄or review quality assurance
by examining the history of each data have extremely low values. Individual procedures. The diagnostic phase is
point or by remeasurement, but such patients’ reports with accumulated labor intensive and the budgetary,
examination is rarely feasible. Instead, information on related measurements logistical, and personnel requirements
one can examine and⁄or remeasure a are helpful for this purpose. This type are typically underestimated or even
sample of inliers to estimate an error of procedure requires insight into the neglected at the study design stage.
rate [24]. Useful screening methods are coherence of variables in a biological How much effort must be spent?
listed in Box 2. or statistical sense. Again, such insight Cost-effectiveness studies are needed
is usually available before the study to answer this question. Costs may
Diagnostic Phase and can be used to plan and program be lower if the data-cleaning process
In this phase, the purpose is to clarify data cleaning. A third procedure is to is planned and starts early in data
the true nature of the worrisome data collect additional information, e.g., collection. Automated query generation
points, patterns, and statistics. Possible question the interviewer⁄measurer and automated comparison of
diagnoses for each data point are as about what may have happened and, successive datasets can be used to lower
follows: erroneous, true extreme, true if possible, repeat the measurement. costs and speed up the necessary steps.
normal (i.e, the prior expectation Such procedures can only happen
was incorrect), or idiopathic (i.e., no if data cleaning starts soon after Treatment Phase
explanation found, but still suspect). data collection, and sometimes After identification of errors, missing
Some data points are clearly logically remeasuring is only valuable very values, and true (extreme or normal)
or biologically impossible. Hence, shortly after the initial measurement. values, the researcher must decide what
one may predefine not only screening In longitudinal studies, variables are to do with problematic observations.
cutoffs as described above (soft often measured at specific ages or The options are limited to correcting,
cutoffs), but also cutoffs for immediate follow-up times. With such designs, deleting, or leaving unchanged. There
diagnosis of error (hard cutoffs) the possibility of remeasuring or are some general rules for which
[10]. Figure 2 illustrates this method. obtaining measurements for missing option to choose. Impossible values
Sometimes, suspected errors will fall in data will often be limited to predefined are never left unchanged, but should
between the soft and hard cutoffs, and allowable intervals around the target be corrected if a correct value can

PLoS Medicine | www.plosmedicine.org 0968 October 2005 | Volume 2 | Issue 10 | e267


be found, otherwise they should be validity and precision of outcomes. It Box 2. Screening Methods
deleted. For biological continuous may be necessary to amend the study
• Checking of questionnaires using fixed
variables, some within-subject variation protocol, regarding design, timing,
algorithms.
and small measurement variation is observer training, data collection,
present in every measurement. If a and quality control procedures. In • Validated data entry and double data
remeasurement is done very rapidly extreme cases, it may be necessary to entry.
after the initial one and the two values restart the study. Programming of data • Browsing of data tables after sorting.
are close enough to be explained by capture, data transformations, and data • Printouts of variables not passing
these small variations alone, accuracy extractions may need revision, and the range checks and of records not passing
may be enhanced by taking the average analysis strategy should be adapted consistency checks.
of both as the final value. to include robust estimation or to do
What should be done with true separate analyses with and without • Graphical exploration of distributions:
extreme values and with values that remaining outliers and⁄or with and box plots, histograms, and scatter plots.
are still suspect after the diagnostic without imputation. • Plots of repeated measurements on the
phase? The investigator may wish same individual, e.g., growth curves.
to further examine the influence of Data Cleaning as a Study- • Frequency distributions and cross-
such data points, individually and as Specific Process tabulations.
a group, on analysis results before The sensitivity of the chosen statistical • Summary statistics.
deciding whether or not to leave the analysis method to outlying and missing
data unchanged. Statistical methods values can have consequences in terms • Statistical outlier detection.
exist to help evaluate the influence of the amount of effort the investigator
of such data points on regression wants to invest to detect and remeasure. so that examination by an independent
parameters. Some authors have It also influences decisions about what expert may be needed.
recommended that true extreme to do with remaining outliers (leave In small studies, a single outlier will
values should always stay in the analysis unchanged, eliminate, or weight during have a greater distorting effect on the
[25]. In practice, many exceptions are analysis) and with missing data (impute results. Some screening methods such
made to that rule. The investigator or not) [27–31]. Study objectives as examination of data tables will be
may not want to consider the effect codetermine the required precision of more effective, whereas others, such
of true extreme values if they result the outcome measures, the error rate as statistical outlier detection, may
from an unanticipated extraneous that is acceptable, and, therefore, the become less valid with smaller samples.
process. This becomes an a posteriori necessary investment in data cleaning. The volume of data will be smaller;
exclusion criterion and the data points Longitudinal studies necessitate hence, the diagnostic phase can be
should be reported as “excluded from checking the temporal consistency cheaper and the whole procedure
analysis”. Alternatively, it may be that of data. Plots of serial individual more complete. Smaller studies usually
the protocol-prescribed exclusion data such as growth data or repeated involve fewer people, and the steps
criteria were inadvertently not applied measurements of categorical variables in the data flow may be fewer and
in some cases [26]. often show a recognizable pattern from more straightforward, allowing fewer
Data cleaning often leads to insight which a discordant data point clearly opportunities for errors.
into the nature and severity of error- stands out. In clinical trials, there In intervention studies with interim
generating processes. The researcher may be concerns about investigator evaluations of safety or efficacy, it is of
can then give methodological feedback bias resulting from the close data particular importance to have reliable
to operational staff to improve study inspections that occur during cleaning, data available before the evaluations
take place. There is a need to initiate
and maintain an effective data-cleaning
process from the start of the study.

Documentation and Reporting


Good practice guidelines for data
management require transparency
and proper documentation of all
procedures [1–4,30]. Data cleaning, as
an essential aspect of quality assurance
and a determinant of study validity,
should not be an exception. We suggest
including a data-cleaning plan in study
protocols. This plan should include
budget and personnel requirements,
DOI: 10.1371⁄journal.pmed.0020267.g002 prior expectations used to screen
Figure 2. Areas within the Range of a Continuous Variable Defined by Hard and Soft Cutoffs
suspect data, screening tools, diagnostic
for Error Screening and Diagnosis, with Recommended Diagnostic Steps for Data Points procedures used to discern errors from
Falling in Each Area true values, and the decision rules that
(Illustration: Giovanni Maki) will be applied in the editing phase.

PLoS Medicine | www.plosmedicine.org 0969 October 2005 | Volume 2 | Issue 10 | e267


Proper documentation should exist for for Clinical Data Management. Available: 17. Myers RH (1990) Classical and modern
http:⁄⁄www.acdm.org.uk⁄files⁄pubs⁄ regression with applications, 2nd ed. Boston:
each data point, including differential DHP%20Guidelines.doc. Accessed PWS-KENT. 488 p.
flagging of types of suspected 28 July 2005. 18. Wainer H, Schachts S (1978) Gapping.
features, diagnostic information, and 3. Food and Drug Administration (1999) Psychometrika 43: 203–212.
Guidance for industry: Computerized systems 19. Wang RY (1998) A product perspective on
information on type of editing, dates, used in clinical trials. Washington (D. C.): total data quality management. Commun
and personnel involved. Food and Drug Administration. Available: ACM 41: 58–63.
In large studies, data-monitoring http:⁄⁄www.fda.gov⁄ora⁄compliance_ref⁄ 20. Centers for Disease Control and Prevention
bimo⁄ffinalcct.htm. Accessed 28 July 2005. (2002) Epi Info, revision 1st ed. [computer
and safety committees should receive 4. Society for Clinical Data Management (2003) program]. Washington (D. C.): Centers for
detailed reports on data cleaning, and Good clinical data management practices, Disease Control and Prevention. Available:
version 3.0. Milwaukee (Wisconsin): Society http:⁄⁄www.cdc.gov⁄epiinfo. Accessed 14 July
procedural feedbacks on study design for Clinical Data Management. Available: 2005.
and conduct should be submitted to a http:⁄⁄www.scdm.org⁄GCDMP. Accessed 28 21. Lauritsen JM, Bruus M, Myatt MA (2001)
study’s steering and ethics committees. July 2005. EpiData, version 2 [computer program].
5. Armitage P, Berry G (1987) Statistical methods Odense (Denmark): Epidata Association.
Guidelines on statistical reporting of in medical research, 2nd ed. Oxford: Blackwell Available: http:⁄⁄www.epidata.dk. Accessed 14
errors and their effect on outcomes Scientific Publications. 559 p. July 2005.
in large surveys have been published 6. Ki FY, Liu JP, Wang W, Chow SC (1995) The 22. Bauer UE, Johnson TM (2000) Editing data:
impact of outlying subjects on decision of bio- What difference do consistency checks make?
[31]. We recommend that medical equivalence. J Biopharm Stat 5: 71–94. Am J Epidemiol 151: 921–926.
scientific reports include data- 7. Horn PS, Feng L, Li Y, Pesce AJ (2001) Effect 23. Winkler WE (1998) Problems with inliers.
of outliers and non-healthy individuals on Washington (D. C.): Census Bureau. Research
cleaning methods. These methods reference interval estimation. Clin Chem 47: Reports Series RR98⁄05. Available: http:⁄⁄www.
should include error types and rates, 2137–2145. census.gov⁄srd⁄papers⁄pdf⁄rr9805.pdf.
at least for the primary outcome 8. American Statistical Association (1999) Accessed 14 July 2005.
Ethical guidelines for statistical practice. 24. West M, Winkler RL (1991) Database error
variables, with the associated deletion Alexandria (Virginia): American Statistical trapping and prediction. J Am Stat Assoc 86:
and correction rates, justification Association. Available: http:⁄⁄www.amstat. 987–996.
for imputations, and differences in org⁄profession⁄index.cfm?fuseaction=ethicalst 25. Gardner MJ, Altman DG (1994) Statistics with
atistics. Accessed 13 July 2005. confidence. London: BMJ. 140 p.
outcome with and without remaining 9. Hadi AS (1992) Identifying multiple outliers in 26. Fergusson D, Aaron SD, Guyatt G,
outliers [25].  multivariate data. J R Stat Soc Ser B 54: 761–771. Hebert P (2002) Post-randomization
10. Altman DG (1991) Practical statistics in medical exclusions: The intention to treat principle
research. London: Chapman and Hall. 611 p. and excluding patients from analysis.
Acknowledgments 11. Snedecor GW, Cochran WG (1980) Statistical BMJ 325: 652–654.
This work was generously supported by the methods, 7th ed. Ames (Iowa): Iowa State 27. Allison PD (2001) Missing data. Thousand
Wellcome Trust (grants 063009⁄B⁄00⁄Z and University Press. 507 p. Oaks (California): Sage Publications. 93 p.
12. Iglewicz B, Hoaglin DC (1993) How to detect 28. Twisk J, de Vente W (2002) Attrition in
GR065377). and handle outliers. Milwaukee (Wisconsion): longitudinal studies: How to deal with missing
ASQC Quality Press. 87 p. data. J Clin Epidemiol 55: 329–337.
References 13. Hartigan JA, Hartigan PM (1985) The dip test 29. Schafer JL (1997) Analysis of incomplete
1. International Conference on Harmonization of unimodality. Ann Stat 13: 70–84. multivariate data. London: Chapman and
(1997) Guideline for good clinical practice: 14. Welsch RE (1982) Influence functions and Hall. 448 p.
ICH harmonized tripartite guideline. Geneva: regression diagnostics. In: Launer RL, Siegel 30. South Africans Medical Research Council
International Conference on Harmonization. AF, editors. Modern data analysis. New York: (2000) Guidelines for good practice in the
Available: http:www.ich.org⁄MediaServer. Academic Press. pp. 149–169. conduct of clinical trials in human participants
jser?@_ID=482&@_MODE=GLB. Accessed 29 15. Haykin S (1994) Neural networks: A in South Africa. Pretoria: Department of
July 2005. comprehensive foundation. New York: Health. 77 p.
2. Association for Clinical Data Management Macmillan College Publishing. 696 p. 31. Gonzalez ME, Ogus JL, Shapiro G, Tepping
(2003) ACDM guidelines to facilitate 16. SAS Institute (2002) Enterprise miner, BJ (1975) Standards for discussion and
production of a data handling protocol. release 4.1 [computer program]. Cary (North presentation of errors in survey and census
St. Albans (United Kingdom): Association Carolina): SAS Institute. data. J Am Stat Assoc 70: 6–23.

PLoS Medicine | www.plosmedicine.org 0970 October 2005 | Volume 2 | Issue 10 | e267

S-ar putea să vă placă și