Documente Academic
Documente Profesional
Documente Cultură
Policy Forum
I
n clinical epidemiological research, less attention. As a result, even though
errors occur in spite of careful the importance of data-handling Box 1. Terms Related to Data
study design, conduct, and procedures is being underlined in good Cleaning
implementation of error-prevention clinical practice and data management
Data cleaning: Process of detecting,
strategies. Data cleaning intends to guidelines [1–3], there are important
diagnosing, and editing faulty data.
identify and correct these errors or at gaps in knowledge about optimal data-
least to minimize their impact on study handling methodologies and standards Data editing: Changing the value of data
results. Little guidance is currently of data quality. The Society for Clinical shown to be incorrect.
available in the peer-reviewed literature Data Management, in their guidelines Data flow: Passage of recorded
on how to set up and carry out cleaning for good clinical data management information through successive
efforts in an efficient and ethical practices, states: “Regulations and information carriers.
way. With the growing importance guidelines do not address minimum Inlier: Data value falling within the
of Good Clinical Practice guidelines acceptable data quality levels for clinical expected range.
and regulations, data cleaning and trial data. In fact, there is limited
Outlier: Data value falling outside the
other aspects of data handling will published research investigating the
expected range.
emerge from being mainly gray- distribution or characteristics of clinical
literature subjects to being the focus trial data errors. Even less published Robust estimation: Estimation of
of comparative methodological information exists on methods of statistical parameters, using methods that
studies and process evaluations. quantifying data quality” [4]. are less sensitive to the effect of outliers
We present a brief summary of the Data cleaning is emblematic of the than more conventional methods.
scattered information, integrated into historical lower status of data quality
a conceptual framework aimed at issues and has long been viewed as
detection and handling of missing
assisting investigators with planning a suspect activity, bordering on data
data have received separate attention
and implementation. We recommend manipulation. Armitage and Berry
[9–18], the data-cleaning process,
that scientific reports describe data- [5] almost apologized for inserting
as a whole, with all its conceptual,
cleaning methods, error types and a short chapter on data editing in
organizational, logistical, managerial,
rates, error deletion and correction their standard textbook on statistics
and statistical-epidemiological
rates, and differences in outcome with in medical research. Nowadays,
aspects, has not been described
and without remaining outliers. whenever discussing data cleaning,
or studied comprehensively. In
it is still felt to be appropriate to
The History of Data Cleaning statistical textbooks and non-peer-
start by saying that data cleaning
With Good Clinical Practice guidelines can never be a cure for poor study
being adopted and regulated in more design or study conduct. Concerns Citation: Van den Broeck J, Argeseanu Cunningham
and more countries, some important about where to draw the line between S, Eeckels R, Herbst K (2005) Data cleaning: Detecting,
shifts in clinical epidemiological data manipulation and responsible diagnosing, and editing data abnormalities. PLoS Med
2(10): e267.
research practice can be expected. One data editing are legitimate. Yet all
of the expected developments is an studies, no matter how well designed Copyright: © 2005 Van den Broeck et al. This is an
increased emphasis on standardization, and implemented, have to deal with open-access article distributed under the terms of the
Creative Commons Attribution License, which permits
documentation, and reporting of data errors from various sources and their unrestricted use, distribution, and reproduction in
handling and data quality. Indeed, effects on study results. This problem any medium, provided the original work and source
are properly cited.
in scientific tradition, especially in occurs as much to experimental as to
academia, study validity has been observational research and clinical trials Jan Van den Broeck is an epidemiologist, and Kobus
discussed predominantly with regard [6,7]. Statistical societies recommend Herbst is a public-health physician at the Africa
Centre for Health and Population Studies, Mtubatuba,
to study design, general protocol that description of data cleaning be a South Africa. Solveig Argeseanu Cunningham is
compliance, and the integrity and standard part of reporting statistical a demographer at the University of Pennsylvania,
experience of the investigator. Data methods [8]. Exactly what to report Philadelphia, Pennsylvania, United States of America.
Roger Eeckels is Professor Emeritus of Pediatrics at
handling, although having an equal and under what circumstances remains the Catholic University of Leuven, Leuven, Belgium.
potential to affect the quality of study mostly unanswered. In practice, it is
Competing Interests: The authors have declared
results, has received proportionally rare to find any statements about data- that no competing interests exist.
cleaning methods or error rates in
The Policy Forum allows health policy makers around medical publications. *To whom correspondence should be addressed.
the world to discuss challenges and opportunities for E-mail: jan.broeck@africacentre.ac.za
improving health care in their societies.
Although certain aspects of data
cleaning such as statistical outlier DOI: 10.1371/journal.pmed.0020267
DOI: 10.1371/journal.pmed.0020267.t001
of relationships [22]. Second, the diagnosis will be less straightforward. times. Such intervals can be set wider
application of these criteria can be In these cases, it is necessary to apply a if the analysis foresees using age or
planned beforehand, to be carried out combination of diagnostic procedures. follow-up time as a continuous variable.
during or shortly after data collection, One procedure is to go to previous Finding an acceptable value does
during data entry, and regularly stages of the data flow to see whether not always depend on measuring or
thereafter. Third, comparison of the a value is consistently the same. This remeasuring. For some input errors,
data with the screening criteria can be requires access to well-archived and the correct value is immediately
partly automated and lead to flagging of documented data with justifications obvious, e.g., if values of infant length
dubious data, patterns, or results. for any changes made at any stage. are noted under head circumference
A special problem is that of erroneous A second procedure is to look for and vice versa. This example again
inliers, i.e., data points generated by information that could confirm the illustrates the usefulness of the
error but falling within the expected true extreme status of an outlying data investigator’s subject-matter knowledge
range. Erroneous inliers will often point. For example, a very low score for in the diagnostic phase. Substitute
escape detection. Sometimes, inliers weight-for-age (e.g., −6 Z-scores) might code values for missing data should be
are discovered to be suspect if viewed in be due to errors in the measurement corrected before analysis.
relation to other variables, using scatter of age or weight, or the subject may be During the diagnostic phase, one may
plots, regression analysis, or consistency extremely malnourished, in which case have to reconsider prior expectations
checks [23]. One can also identify some other nutritional variables should also and⁄or review quality assurance
by examining the history of each data have extremely low values. Individual procedures. The diagnostic phase is
point or by remeasurement, but such patients’ reports with accumulated labor intensive and the budgetary,
examination is rarely feasible. Instead, information on related measurements logistical, and personnel requirements
one can examine and⁄or remeasure a are helpful for this purpose. This type are typically underestimated or even
sample of inliers to estimate an error of procedure requires insight into the neglected at the study design stage.
rate [24]. Useful screening methods are coherence of variables in a biological How much effort must be spent?
listed in Box 2. or statistical sense. Again, such insight Cost-effectiveness studies are needed
is usually available before the study to answer this question. Costs may
Diagnostic Phase and can be used to plan and program be lower if the data-cleaning process
In this phase, the purpose is to clarify data cleaning. A third procedure is to is planned and starts early in data
the true nature of the worrisome data collect additional information, e.g., collection. Automated query generation
points, patterns, and statistics. Possible question the interviewer⁄measurer and automated comparison of
diagnoses for each data point are as about what may have happened and, successive datasets can be used to lower
follows: erroneous, true extreme, true if possible, repeat the measurement. costs and speed up the necessary steps.
normal (i.e, the prior expectation Such procedures can only happen
was incorrect), or idiopathic (i.e., no if data cleaning starts soon after Treatment Phase
explanation found, but still suspect). data collection, and sometimes After identification of errors, missing
Some data points are clearly logically remeasuring is only valuable very values, and true (extreme or normal)
or biologically impossible. Hence, shortly after the initial measurement. values, the researcher must decide what
one may predefine not only screening In longitudinal studies, variables are to do with problematic observations.
cutoffs as described above (soft often measured at specific ages or The options are limited to correcting,
cutoffs), but also cutoffs for immediate follow-up times. With such designs, deleting, or leaving unchanged. There
diagnosis of error (hard cutoffs) the possibility of remeasuring or are some general rules for which
[10]. Figure 2 illustrates this method. obtaining measurements for missing option to choose. Impossible values
Sometimes, suspected errors will fall in data will often be limited to predefined are never left unchanged, but should
between the soft and hard cutoffs, and allowable intervals around the target be corrected if a correct value can