Sunteți pe pagina 1din 12
CHAPTER 30 Data Management Pes USL SS AS IOS 02 2 ‘An important part of the research planning process is the development of a data man- agement plan that specifies how data will be recorded, organized, reduced and ana- ed. This ptan begins with the research proposal, specifying the research question, hypotheses and design. Before any data are collected, the researcher must be able to identify what variables will be measured, using what instruments and units of meas- urement. Those who will collect data may need to be trained and reliability assessments done. Undoubtedly, some of these plans will change once the project has begun, but nothing should begin without a firm plan in place. This planning requires knowledge of data coding and format requirements, statistics and computers. The purpose of this chapter is to describe procedures for setting up data to be entered into a computer and analyzed with statistical programs. CONFIDENTIALITY AND SECURITY OF DATA The research proposal will include a plan for handling data, including maintaining con- fidentiality of participant information. All subjects should be assigned a unique [D num- ber that is n. ited to their name, medical unit number, Social Security number or other personal identifier. Documents for data collection should include the subject ID only. A list of subject names, addresses or phone numbers and corresponding ID codes can be kept separate and secured from other files in case participants need to be contacted. As part of informed consent, subjects should be assured that their personal infor- mation, data from medical records and data collected as part of the project will only be accessed as necessary for research. The institutional review board (IRB) that approves the project will want to know the type of data to be collected, the purposes for which the data will be used, who will have access to records, and what safeguards have been put in place for security and confidentiality (see Chapter 3). Many countries have reg- ulations in place that define these standards. In the United States, these are part of the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA).! In nada, they are incorporated into the Tri-Council Policy Statement: Ethical Conduct Research Involving Humans. IONITORING SUBJECT PARTICIPATION woughout the project, researchers should have procedures in place to keep accurate complete records of subject involvement. Records should indicate how many sub- were recruited and why some were not eligible, how many agreed to participate, 727 728 PART IV Bl Data Analysis and how many eventually did participate. Attrition should be monitored, and reasong noted if possible. Changes to the research protocol must be described. Initial Broup assignments and deviations from these assignments should be documented. This infor. mation is relevant to the validity of the project and will be important if the researcher wants to complete an intention to treat analysis (see Chapter 9). STATISTICAL PROGRAMS The number of statistical packages for use on a microcomputer has grown dramatically, many at reasonable prices. The two most commonly used programs are SPSS (Statisti. cal Package for the Social Sciences)* and SAS (Statistical Analysis System).* SPSS has traditionally been used more for the social and behavioral sciences, although its use has increased in health care research. SAS is most useful in medicine and epidemiology as a biostatistics program. These packages, once available only on main frames, have been adapted for use on personal computers. Many other programs are also on the market, and it would be useless to name them here as we are sure more will be published by the timeyou read this. Even though these packages are all slightly different, they adhere to certain standards that are important for data management. Most programs provide a format for data entry similar to a spreadsheet. Data may also be imported into a statis- tical program from a spreadsheet such as Microsoft Excel®. DATA COLLECTION FORMS A dala recording system must be carefully developed. Typically, data are collected from eachsubject and recorded on a separate sheet or directly into a computer program. The subject’s identification code is listed, as well other relevant information such as the date, the individual collecting the data (if there is more than one investigator), the sub- ject’sgroup assignment and demographic information such as age, gender and diagno- sis. If possible, all data should be listed in the order they will be included in the data file, to facilitate data entry. Figure 30.1 illustrates a data collection form for a study of two diet regimens in patients with diabetes. The researcher must make decisions about how data will be recorded for each sub- ject. there a level of precision in measurements that should be used, such as measur- ing © the nearest millimeter or half inch? The format for recording open-ended responses or qualitative data should be specified. If data are missing, the reason should be included. The importance of a well organized data collection scheme becomes most evident when the researcher begins to enter data into a computer. If data are not clearly recorded and in a consistent format, data entry will be a difficult and potentially error ridden process. “SPSSInc., 233 S. Wacker Drive, Chicago, IL 60606 'SASIinstitute Inc., 100 SAS Campus Drive, Cary, NC 27513 = CHAPTER 30 Bl Data Management 729 IGURE 30.1 Sample data collection form. ATA CODING 2 essential part of the data collection plan is the development of a scheme for ‘cording data. Some measurements produce quantitative data, such as range of lotion and blood pressure. Variables such as gender, group and race produce cate- srical data. Surveys and qualitative studies may produce open-ended responses ‘at must be coded. 730 PART IV BI Data Analysis Types of Variables Data can be entered as numerals or characters. Quantitative data are numeric, havin, values of single or multiple digits, sometimes including decimal points, and composed of only numbers. Numeric values can be preceded by a plus or minus sign, although plus signs are assumed and not entered. Character variables, called alphanumeric o, string variables, are composed of letters or characters and may include digits. String variables may be letters or words that represent variable values, such as male/female or the names of states or cities. Money values can be coded for different monetary units, with or without decimals places. Variables can also be entered as dates, using one of many acceptable forms, such as MM-DD-YYYY. Date fields can be added or subtracted to determine length of time in days, weeks, months or years. Codes for Categorical Variables Data for categorical variables are entered as labels. For instance, if gender is a variable, we enter either male or female as the data value. Although we can enter the full label as the data, it is much easier to code these values. Using character codes, for instance, gender could be coded F for female and M for male. It is generally recommended, how- ever, that codes be entered as numeric data to facilitate statistical analysis, such as cod- ing 1 for female and 0 for male. For dichotomous variables it is conventional to use 1 and 0 as codes, usually signifying the absence of a trait as zero. As a pure label it does not matter whether we code gender as 1 and 0, as 1 and 2, or any other number; how- ever, many statistical procedures will only manipulate categorical data with 1 and 0 as the category codes (see discussion of dummy variables in Chapter 24). When the research design includes group comparisons, each subject’s group assignment must be identified by a code for the grouping variable. Decisions about coding categorical variables should be made before data are collected. Codes should be used on data col- lection forms to expedite transfer of data to the computer. Missing Data It is not unusual for some pieces of data to be missing from a subject’s record because of errors in recording, unavailability of information, nonresponses on surveys, oF problems in data collection. To identify missing values, blanks are used as the default in most computer programs. Others have specific rules for identifying missing values, such as the use of a period in place of a missing datum. It is not advisable to use zeros to represent missing values, as zeroes will be read as a number and there may be true zeroes in the data. It is often useful to assign specific codes for missing values, to iden- tify the reason for the missing information. For instance, separate codes might be used to distinguish a refusal to answer a question, a response of “Don’t know,” a question that was not asked, investigator error and so on. Such distinctions can be helpful for interpretation of results, especially when there are many missing data points. Missing data should be coded using numeric values that are out of range of any actual data val- ues. For example, the code of —99 is commonly used. CHAPTER 30 Bl Data Management 731 JATA ENTRY she standard structure for data entry requires that each variable is entered in a separate ‘olumn, and each row represents an individual subject. Data may be typed directly into statistical program, or it may be entered in a spreadsheet first and later imported into he statistical program. No matter how information is entered, the wise researcher will ave the data often and back up the data file regularly. We have suffered with too many colleagues who have lost hours of work to take this advice lightly! When data originate as a spreadsheet, the first row in each column should contain he variable name, which will then be read by the statistical program. To facilitate this sansfer, the variable names should conform to the restrictions of the statistical pro- yam. Other than this first row, all other rows in the spreadsheet should contain only jata. Naembedded formulas or charts should be included. If formulas are used in spe- ‘fic cells, they should be converted to the actual data values before transferring to a sta- istical program. Variable Names Variable names identify each data point in a file. Every variable in the file must have a unique name. When variable names are long, abbreviations can be used. As much as possible, variable names should be readily identifiable. Certain rules apply to variable names, depending on the statistical package being used. Many programs require that a variable name be no more than eight characters (numbers or letters), although more recent versions of some packages allow for longer variable names. Variable names typically must begin with a letter and have no spaces. Some programs allow hyphens, underscores, dollar signs or number signs within a variable name. Generally special characters such as |, ? and / cannot be used in variabie names. For example, a pretest and posttest value for pain could be coded PAIN] for the pretest and PAIN2 for the posttest. Researchers should be familiar with the require- ments for the statistical package they use. Variable Fields Each row of data, representing a single subject's scores, is called a record or case. Each individual score, or variable value, is identified as a field. A case is composed of sev- tal fields of data. Fields are described according to their width, that is, the number of digits or spaces needed for the maximum possible value. The field width is described according to the format Fw.d, where w is the total number of spaces (or field width), and dis the number of digits within the field that follow a decimal point (the F is for For- mat). For example, the value 7.85 takes up four spaces (including the decimal point), for afield width of F4.2. The value 3560 also takes up four spaces with no decimal places, fora field width of F4.0. The value 136.45 takes up six spaces, described as a field width (of F6.2. Many programs set a default field width that can be changed by the researcher. 732 PART IV Ml Data Analysis Labels Because variable names must be kept short and categories are coded, it is sometimes confusing to read a printout of an analysis with many abbreviations. To facilitate reag_ ing the output, most programs will allow the researcher to specify labels for variable names and for category value codes. These labels can usually extend to 40-60 charac. ters or longer. They allow the researcher to customize the printout in a way that will be convenient for interpretation. To make this happen, however, the researcher must take the time to type in all the labels. But it is worth the effort when reams of paper are sit. ting in front of you and you can’t remember whether males are coded 1 or O! Labels are not required, but with large data sets they are extremely useful. Labels may be listed in the data collection form. Code Books Code books are used to organize data and to catalog the order of entry of all variables, Variable names are listed with their abbreviations. Codes are listed to identify their val- ues. Figure 30.2 shows a sample page from a code book for the study examining the effectiveness of two diet regimens on fasting blood sugar in patients with diabetes. Data were collected on the subjects’ age, gender, and baseline and follow-up blood sugar lev- els. Two trials were performed for each test. Codes were developed for gender and group assignment. Subjects were also asked how often they exercised and if they were compliant with their medications. The code book is a necessary reference for all those who are involved with the study, most especially those who will analyze the data. SPSS provides this information in the Variable View of the data file. DATA CLEANING Once data are entered into the computer, and before analyses are run, the data should be checked against the raw data to be sure there are no discrepancies or coding errors. This process is called data cleaning, and although it may be time consuming and tedious, it is essential to ensure validity of the data analysis. The data file can be printed out or displayed on a computer screen and visually checked for accuracy against the original data. Running descriptive statistics on the data will allow the researcher to see if there are obvious discrepancies. Frequency counts should be checked for all categorical vari- ables. The output will list all the codes for each variable and the number of times that code appears in the data. It will also indicate how many subjects are counted, and if there are missing data for that variable. This allows the researcher to determine if there _ are mistakes in codes, or if the variable has too few entries to be useful. For continuous variables, descriptive statistics and graphs, such as histograms or plots, should be ™ to analyze means, minimums and maximums, to be sure that the range of scores # | appropriate. In this way, the researcher can ascertain if values out of the possible rangé have been entered. For instance, if the maximum blood sugar score is printed as 560, th researcher knows there is an error and can go back and correct that entry. Sometimesit CHAPTER 30 88 Data Management 733 GURE 30.2 Sample page from a code book. useful to sort data, reordering the subjects according to the value of a particular vari- lle, to determine if appropriate numbers have been entered. ‘ATA MODIFICATION statistical programs include processes for data modification or transformation to tate new variables or to assign new codes to existing variables. For example, we ight want to compute the mean of several trials to use for data analysis. Or we might We scores for several items on a scale and want to get the sum. Perhaps a continuous Mable will be converted to categories. When these types of transformations are per- led, a new variable is created, and must be given a new and unique variable name. Computing New Variables Computing anew variable requires that some arithmetic operation be performed on the existing data. All programs use the same symbols to represent logical operations. These symbols, known as operators, are used to create expressions that are instructions to the computer. The following symbols are used for arithmetic operations: Rt These expressions are considered simple expressions because they contain one oper- ator. When more than one operator is used, a compound expression is created, for instance, AM2*B/(C + 1.0) is a compound expression. This expression is equal to (AV) c+10 When compound expressions are used, specific rules apply to the order in which oper- ations take place. First, all expressions within parentheses are carried out. Second, adja- cent operations are carried out in the following order: (1) exponentiation, (2) division and multiptication and (3) addition and subtraction. Within each of these levels, opera~ tions proceed from left io right. Therefore, in the preceding expression, the first opera- tion will be to complete the addition (C + 1.0) within the parentheses. Next, the value of A will be squared. This value will then be multiplied by B. Lastly, this product will be divided by the sum (C + 1.0). If the parentheses had been left out, the expression would be read differently. Using A™2*B/C + 1.0 the expression would read (4)(B) “+ ra 10 To illustrate the application of these arithmetic operators, we might want to compute 2 mean baseline and follow-up score to use for analysis for the data in Figure 30.3. To do this, we tell the computer we want to create two new variables called BASEMEAN and FOLLMEAN using the following expressions: BASEMEAN = (BASE1 + BASE2)/2 FOLLMEAN = (FOLLOW1 + FOLLOW2)/2 CHAPTER 30 Mf Data Management 735 Collected Data 162,00 180.00 192.00 175.00 Collected Data Transformed Data 259.00 [186.00 282.50 | 167.50 153.50 | 122.50 236.50 167.00 164.50 | 119.00 165.00 | 170.00 | 200.00 120.00 | 125.00 | 179.00 170.00 | 164.00 | 152.00 120.00 | 118.00 [ 185.00 rs] ] =] =|] =f no] a] =] 0] =| 0 fe JURE 30.3 Data file for a pretest-posttest design, showing original data collected as part of the study {transformed data created through computing and recoding variables, Subjects are identified by ID ther. All data for each subject appears on one row in the file. Note the importance of the parentheses, so that the sum of the two items is divided 2, and not just the value for BASE2 or FOLLOW2. When these computations are fe the values for the new variables will appear as new columns in the data file, as nin Figure 30.3. These new variables can now be used in statistical procedures. We {id for instance, get a difference score between BASEMEAN and FOLLMEAN, and ject these values to a t-test. This type of data modification can also be done within Pdsheet programs. k toding Variables {can also use comparison operators to recode variables by specifying relationships fen them. Comparison operators may be specified as symbols or letter combinations: 736 PART IV Data Analysis Comparison operators are usually used with an IF statement, which specifies a specific operation to be carried out if a given relationship exists. For instance, we have a vari- able called AGE in our data set (see Figure 30.2). We can create two age groups for a comparison analysis, as follows: IF AGE < 30, AGEGRP = 1 IF AGE >= 30, AGEGRP = 2 These statements illustrate how we specify values for a new variable called AGEGRP (shown in the last column in Figure 30.3). The actual method for setting recode values will depend on the statistical program. When assigning values to anew variable, the researcher must be careful not to over- lap any categories, or the computer will not be able to perform the desired functions. In addition, groupings should reflect the full range of values that is present in the data. Statistical Procedures Many statistical procedures also provide a mechanism for creating and saving new variables. For example, when running a factor analysis, factor scores are created for each subject on each factor. These values can be saved and used as variables in future analyses. When regression procedures are run, residual scores can be calculated and saved. Most programs require specific instructions for these options. DATA ANALYSIS Data collection is complete, all the data are entered and saved (and backed up!), and you are set to begin data analysis. If the research proposal was done well, you are ready to approach this phase of the research process in an organized way. It is a good idea to start by becoming familiar with the data by looking at descriptive statistics—frequen- cies for categorical variables and means for continuous variables. Histograms, line plots, stem-and-leaf plots or box plots are helpful to visually assess the shape of a dis- tribution, and to identify gaps or outliers. For correlational data, scatterplots should be created to get a sense of the linearity and degree of relationship in the data, These ink tial steps are necessary to understand the scope of the data, and may suggest alterna tive statistical approaches. For example, transformations may be needed for nonlineat variables (see Appendix D). CHAPTER 30 IB Data Management 737 ‘The next step is the culmination of all the research efforts—to apply statistical pro- cedures to answer the research question. This is the fun part. Some helpful hints: To make this process efficient, prepare a list of specific hypotheses, variables and appropriate statistical procedures to guide your time at the computer. Be specific. For instance, if you intend to compare two groups, specify the t-test, paired or unpaired, and which variables will be used. If you run several regressions, list which are the inde- dent and dependent variables for each one. Then you won't have to sit at the com- puter, faced with columns and columns of data, and wonder where to start. Look at the output as you generate it. Examine your findings. Often, additional questions emerge and you may choose to run further tests. For instance, you may find ationships among some variables that you did not anticipate. Groups may end up having different characteristics than planned. It may be of interest to perform certain alyses on subgroups within the data. Statistical programs provide different filtering options to select subjects according to a specified criterion. You might specify that an analysis be done only on those coded 1 for gender, or only those coded for group 1. Finally, most statistical programs include choices for creating tables or charts irectly from the data. Many of these programs provide fairly sophisticated options, ith a variety of fonts and colors to customize your presentation. These charts and ‘ables can be imported into word processing or presentation programs. Many different types of charts are usually available, and it is often helpful to try out different formats to see which presents the data best. Be sure you save your data and output so you can play with options and prepare {your project for the final phase of the process—dissemination as a journal article or presentation as a platform or poster. Because of the seemingly overwhelming power of computers for statistical analysis, -it may seem unnecessary to become Proficient in statistics. The computer seems to be able to handle the job of running statistical procedures with infinite ease, and can * provide answers to statistical-questions without the researcher ever having to crack: -a formula. The days of writing out a program and searching for the misplaced semi- Colon are-gone, Today you needa mouse and a keyboard, and once you have © entered your data and variable names you have very little else to do. Most programs "will guide you through analyses by clicking on the appropriate button. This is an ‘oversimplification. of the situation, however, for two reasons. First, the Tesearcher must know the conceptual foundations for the statistical tests that will be Used to make thé appropriate choices in the first place, The computer can only carry Out the instructions it is given. Programs require that the researcher sort through dif- ferent options that will dictate how the procedures will be carried out. Most run at default settings, that is, parameters that are set at a certain level unless they are specif ically changed. For instance, to run a stepwise regression procedure, variables will be included in the equation if partial correlations reach a specific level of significance. 738 PART IV BB Data Analysis ~The default setting may be .05 or'.15. The analysis will run at that level unless the si er 5 ies.a different level in the program. In. addition, there are severa| tepwise analysis, and these may have to be specified. Some programs taint summaty statistics by default, such as mean, standard deviation rograms may require additional options to request different infor. cher must know-how the data should be analyzed, interest, and then must be able to instruct the compt nous amount of information generated by a Computer interpretation of that output must be based on an understanding of procedures that were run, If data are entered incorrectly; the output - If the data are inappropriate for a particular procedure, the computer ® able to run an analysis, ‘but the’ output won't be meanir igful. This situa nmed: up: in an important’computer principle: GIGO, ‘which’ means “garbage in, garbage out.” The wise researcher will have sufficient knowledge of both computers and statistics to be able to make the appropriate choices and assure os al conclusion validity for the study. When this knowledge is not sufficient, “advice. should be obtained from a Statistical consultant. REFERENCES. Ug eis asus as mses | 1. National Institutes of Health. HIPAA Privacy Rule: Information for researchers, Available at: Accessed August 17, 2006. 2. Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, Tri- Councit Policy Statement: Ethical Conduct for Research Involving Humans, 1998 (with 2000, 2002 and 2005 amendments). Available at: Accessed August 17, 2006.

S-ar putea să vă placă și