Sunteți pe pagina 1din 13

DATA ANALYSIS I: DESCRIPTIVE STATISTICS

Danaida B. Marcelo, MS

What is Data Analysis? “GIGO (Garbage In, Garbage Out)”


GIGO
▪ Garbage In, Garbage Out”
The process of organizing and interpreting collected ▪ Errors are eliminated
data to answer the research question. ▪ As much as possible, when data is encoded, the
data should already be “clean” or error-free
Data analysis is not just computing for means, t-test ▪ This is to prevent inaccuracy of result. The
values or chi-square values … it also involves the goal is to detect and correct errors
process of organizing data – preparatory work
Steps in Data Processing
Steps in Data Analysis
EDITING
▪ Browse questionnaire for missing answers
▪ visual inspection of data collection forms after the
interview (detect and correct errors)

CODING
▪ Assigning codes to qualitative data
▪ Using numbers to represent categories of
variables
Ex. Gender – “1” for males, “2” for females o Ex.
Exposure factor – “+” or “1” for exposed, or
“0” for unexposed
▪ The reason for doing this is that some
statistical software cannot process data that
have characters

Note: Generally, there are 3 steps: ENCODING DATA


✓ Process Data ▪ Creation of data file – soft copy of the raw data
✓ Describe Data ▪ entering data into a computer
✓ If the study is ANALYTIC, proceed to Estimation ▪ MS Excel and database form can be used
and Hypothesis Testing / If the study ▪ The goal is to have data in a spreadsheet
is DESCRIPTIVE/QUALITATIVE, do not proceed to
Estimation and Hypothesis Testing FINAL EDIT
✓ Descriptive statistics are used to describe ▪ range checks (of each variable)
data ▪ consistency checks
✓ Inferential statistics are used in ▪ As a final step in data processing, edit after
estimating or testing the hypothesis encoding; generate frequency tables for range
and consistency checks
DATA PROCESSING
1. Detect and correct invalid values
• Group of activities to convert raw data to a
form suitable for statistical analysis 2. Note and investigate unusual values
• The goal in processing the data is to check 3. Note outliers (even if correct their presence may
for completeness, consistency, and accuracy have a bearing on which statistical methods to use)
of the raw data 4. Check reasonableness of distributions and also note
• “It is essential that researchers develop their form, since that will also affect choice of
and implement procedures to minimize data statistical procedures
loss, identify concerns soon after data are
collected, and detect and correct errors.
• "No study is better than the quality of the data”
Page 1 of 13
Steps in data processing: EPI-INFO Data Entry Form parity
• epidemiological characteristics:
illness onset, residence, risk
factor exposures (cigarette
smoking, alcohol consumptions,
low vegetable intake, use of oral
contraceptives),presence of
genetic mutation
• The opposite of a variable is a CONSTANT
✓ Typically this is not a concern in research
• Study participants should always be described
according to their social d e m o g r a p h i c
characteristics

Classification of Variables according to place in


relationship

*how we encode data using this

EXAMPLE OF A DATA FILE

• Use Microsoft Excel to create a spreadsheet


format of the data file, where the rows
represent the respondents and the columns
represent the variables
• The first column is usually used for the identity
of the respondents
• Codes are used here rather than names; observe in
this example how the numbers are used in
certain variables (i.e. sex, school, subtype of
ADHD/ODCD) *To analyse relationship we have to take note also the
extraneous variables
Data Analysis Part 1: 

DESCRIBING DATA (VARIABLES) ❖ INDEPENDENT VARIABLE
✓ The variable under investigation which
VARIABLE is hypothesized to have an effect on the outcome
• A characteristic that changes from subject to ✓ The variable assumed to have an effect on the
subject, or a characteristic that changes within a dependent variable
subject from time to time
• Variable = “varying” or “variation” ❖ DEPENDENT VARIABLE
Examples: ✓ The outcome of interest in the investigation
• demographic characteristics: (outcome variable)
age, gender, religion, civil status,
socioeconomic status, occupation ❖ EXTRANEOUS VARIABLES
• clinical characteristics: weight, ✓ Also called confounding variable, confounder,
height, disease condition (positive mediating variable, or intervening variable
or negative), blood pressure,
Page 2 of 13
✓ Variables that are related to both ❖ NOMINAL
the independent and the dependent variable ✓ Uses names, numbers, or other symbols to
✓ This may affect the relationship assign each measurement to one of a limited
between independent and dependent variables number of categories
✓ This warrants the difference of ✓ The data cannot be arranged in an ordering
analyses between different categories created scheme (i.e. low to high; minimum to maximum)
by the confounders, such as young age group vs. because the values are just labels
middle age group vs. elderly age group ✓ These values are categories that should
✓ Effect modifiers modify the effect of IV be exhaustive and mutually exclusive
according to the categories of the modifier; ▪ Exhaustive – there should be enough categories
magnitude of effect is dependent on the value or for all observations
level of the 3rd variable ▪ Mutually exclusive – each measurement will fall
into only one category
o Example: gender – among females, those ✓ Most of the sociodemographic variables that we
who are stressed are twice likely to get collect to describe the patients, or subjects
depressed; while in males, there is no included in the study are nominal variable
association ✓In clinical research, it is common to see
dependent variables in nominal scale with 2
confounders – variables that are related to both the categories
independent and the dependent variable. This may o Examples: Name, gender, civil status, blood
affect the relationship between independent, type, religion, degree program, dichotomous or
dependent variables
binary variables (i.e. absence/presence of
disease status), smoking status (i.e. smoker, non-
modifiers –variables that modifies the effect of the
independent variable; magnitude of the effect is smoker)
dependent on the level of the third variable ✓ EPI is concerned with binary or dichotomous
variables: with or without the factor, with or
without the disease, with or without the
Classification of Variables according to scales of outcome, yes or no
measurement *values are only categorical but cannot order

❖ ORDINAL
✓ Categories can be ordered but differences
between data values either cannot be
determined or are meaningless
✓ Example: Patient status (0 = worse, 1 = stable, 2
= improved); values cannot be subtracted
because 0, 1, 2 are just labels and are
meaningless
✓ Includes the characteristics of the nominal scale
(categories)
✓ Assigns each measurement to one of a limited
number of categories and ranks them in
graded order
✓ Difference between “worse” and“stable” is not
measurable and at the same time is not equal to
the difference between “stable” and “improved”
Note: Aside from identifying the variables according to ✓ The stage of cancer is also an example of an
their role in the hypothesized relationship, it is also ordinal variable – stage 1 is the mildest stage while
important to define how are you going to measure these stage 4 is the most severe – but the
variables. It is imperative to define the variables that differences between stages are not measurable
will be included in a research or study according to its and are not equal
scale of measurement. The scale of measurement is o Examples: Disease severity (mild, moderate,
one factor considered when choosing appropriate severe), Degree of difficulty of exam
statistical techniques when analyzing study results. questions (easy, intermediate, difficult), APGAR
The 4 basic scales of measurement are the nominal, score, Pain score and satisfaction score (0-10)
ordinal, interval and ratio scale. In some books, these *Value are categorical but can order the values
four are termed as levels of measurement instead of because there were inherent order of the values
scales of measurement. That is because from nominal but cannot compute differences of these value
to interval there is an increasing refinement of
measurement. ❖ INTERVAL
✓ Like an ordinal scale, with the additional property
of having equal intervals; meaningful amounts
Page 3 of 13
of differences between data can be determined measurement (e.g., interval or ratio) rather than
✓ Numbers can be arranged in order (minimum to a lower one (nominal or ordinal).
maximum, maximum to minimum) • Certain variables can be measured in various
✓ Unlimited number of values that are equally spaced scales:
✓ There is no inherent (natural or true) zero ▪ Example 1: Age
starting point; the zero is just an arbitrary ❖ Young or old = NOMINAL
measure – therefore, cannot multiply nor divide ❖ In years = RATIO
✓Temperature measured in degrees Celsius = 0 ▪ Example 2: Hemoglobin level
o Normal/abnormal = NOMINAL
degree; Celsius does not mean absence of
o In mg/dl = RATIO
temperature; it is actually the freezing point
• In an ideal research, data is collected in
✓Examples: IQ score, Temperature (in ˚C) 30˚C-20˚C its highest form of scale of measurement
= 10˚C 40˚C-30˚C = 10˚C (can subtract) But we
cannot say that 40˚C is twice more than 20˚C
because of the absence of a true zero point (40 REVIEW: (from Professor’s presentation notes)
is hot, 20 is cold) In nominal measurement the numerical values just
"name" the attribute uniquely. No ordering of the cases
❖ RATIO is implied.
✓ Like the interval level modified to include the
(For example, jersey numbers in basketball are
inherent zero starting point
measures at the nominal level. A player with number 30
✓ 0 = absence of the characteristic, eg. 0 BP = is not more of anything than a player with number 15,
dead patient and is certainly not twice whatever number 15 is.)
✓ For values at this level, differences and ratios
are meaningful In ordinal measurement the attributes can be rank-
✓ Examples: Weight measured in kg; if a person ordered. Here, distances between attributes do not
is 80 kg in weight, then we can say he is twice have any meaning. ( For example, on a survey you might
heavier than a person of 40 kg weight, Height code Educational Attainment as 0=less than H.S.;
in cm, Number of patients seen in a day, 1=some H.S.; 2=H.S. degree; 3=some college; 4=college
Diastolic and systolic blood pressure (in degree; 5=post college.) In this measure, higher
mmHg), Hemoglobin level (µg/dl), Length of numbers mean more education. But is distance from 0
survival time to 1 same as 3 to 4? Of course not. The interval between
values is not interpretable in an ordinal measure.
In interval measurement the distance between
attributes does have meaning. For example, when we
measure temperature (in Fahrenheit), the distance from
30-40 is same as distance from 70-80. The interval
between values is interpretable. Because of this, it
makes sense to compute an average of an interval
variable, where it doesn't make sense to do so for
ordinal scales. But note that in interval measurement
ratios don't make any sense - 80 degrees is not twice as
hot as 40 degrees (although the attribute value is twice
as large).
Finally, in ratio measurement there is always an
absolute zero that is meaningful. This means that you
can construct a meaningful fraction (or ratio) with a
- It is important to consider the highest form of ratio variable.
measurement when collecting data (Weight is a ratio variable. In applied social research
- Nominal attributes are only named – WEAKEST most "count" variables are ratio, for example, the
SCALE OF MEASUREMENT number of clients in past six months. Why? Because you
- Ordinal attributes can be ordered can have zero clients and because it is meaningful to
- Interval attributes are like ordinal but can say that "...we had twice as many clients in the past six
compute for differences months as we did in the previous six months.“)
- Ratio attributes are like interval but can compute
for averages because there is an absolute zero

- At lower levels of measurement, assumptions tend Qualitative and Quantitative Variables


to be less restrictive and data analyses tend to
be less sensitive. At each level up the hierarchy, QUALITATIVE DATA
the current level includes all of the qualities of ✓ Variables that are categorized (categorical
the one below it and adds something new. In values) simply to label or distinguished one
general, it is desirable to have a higher level of group from another
Page 4 of 13
▪ Nominal Frequency Distributions: Nominal Scales
▪ Ordinal

QUANTITATIVE DATA
✓ Variables that can be expressed numerically
▪ Interval
▪ Ratio

QUANTITATIVE DATA: DISCRETE OR CONTINUOUS


❖ DISCRETE VARIABLES
✓ Have integer values or whole numbers
✓ Number of cigarettes smoked, Number of live
births
❖ CONTINUOUS VARIABLES
✓ Have values on a continuum (with decimal
digits)
✓ Weight in kg, Hemoglobin level in mg/d • When creating a frequency distribution for a nominal
scale, describe the participants according to socio-
REVIEW: STEPS IN DATA ANALYSIS demographic characteristics
•Descriptive statistics • It is also important to arrange the categories in order
➢ Describing the nature and characteristics of of frequency, usually from the highest to lowest
the data
➢ “Aspects of organization, presentation, and Frequency Distributions: Ordinal Scales
summarization of data”
➢ “To organize, to summarize observations so
that they are easier to comprehend”

• Inferential statistics
➢ Concerned with analysis of data from a
sample leading to p r e d i c t i o n s o r
associations (inferences) about the target
population
➢Hypothesis testing

SUMMARIZING THE DATA Note: arrange categories according to logical order of


the categories

Frequency Distribution • When creating a frequency distribution for an


ordinal scale, do not arrange the categories
• Display the values a variable can take and according to frequency, Instead, arrange
the number of persons or records with each value the according to the order of variable
*list then count • if ordinal scale at the minimum, you could add
another column for easy interpretation. You could add
cumulative percentage column.

*listing the value of the gender and then counting how


many each then compute the percentage

Page 5 of 13
NOTE: How to compute for Cumulative % ? NOTE: What to do?
Just add the following percentage like: Instead of displaying of each value of the Quantitative
Continuous values, we create class interval and then
25.0 + 65.2 = 90.2
construct the frequency distribution table
90.2 + 8.9 = 99.1
99.1 + 0.9 = 100.0 • CLASS INTERVALS
✓ Categories made from quantitative data
90.2% of the respondents have good quality of life ✓ Sets defined by a lower limit and an upper limit
when doing summary of the table, you don’t need to
enumerate the percentages just highlight the important Quantitative Continuous Variables

one. HISTOGRAM

• A relative cumulative frequency can be added only


if an ordinal variable in its ordinal scale is present
• The RELATIVE CUMULATIVE FREQUENCY (i.e.
“accumulate”) for a value of a variable is the
proportion/percentage of elements in the sample
with values less than or equal to that value
• Cumulative percentage is calculated by addition of
the succeeding percentages of the categories

Frequency Distributions: Quantitative Continuous

• A histogram is needed in quantitative continuous


data to determine the shapes of distribution; it
displays the values of variability in the x-axis, and
frequencies in the y-axis
• With frequency distribution table of quantitative
continuous data, a histogram or a frequency
polygon can be created to describe the quantitative
continuous variables
• This is translating summarized data into graphs
• In data analysis, it is very important to describe
the shapes of distribution that the graphs create

- There is an unlimited number of values in


a continuum Shapes of Distribution

GROUPED Frequency Distributions- Quantitative • Refers to the skewness of the distribution


Continuous

▪ A is a positively skewed distribution , B


is a normal distribution, C is a negatively
skewed distribution
Page 6 of 13
Symmetrical shape
✓ Expect a “bell-shaped” curve
Skewness
✓ Measure of lack of symmetry or
the lopsidedness of a distribution
✓ One “tail” of the distribution is longer than
the other
✓ May be positively skewed or negatively
skewed
✓ Important consideration in choosing
a statistical tool

This graph reveals 3 features:


• Where the distribution has its peak (central location)
• How widely dispersed it is on both sides of the peak
(spread)
• Whether it is more or less symmetrically distributed
on the two sides of the peak

Measures of Central location/tendency

• A measure of central tendency is the one that


best represents an entire group of values or
distribution
• Typical value or a value which is somewhere near
the center of the distribution
Normal distribution
Shows a “bell-shaped” curve
Negatively skewed distribution MEAN
✓ Tail is on the left
✓ Happens when extremely low values are
present • We compute the: Mean, Median, Mode
Positively skewed distribution • Arithmetic mean (average)
✓ Tail is on the right • Sum of all values divided by the total number
✓ Happens when extremely high values are of observations
present

Histograms show 3 characteristics of the distribution: n


➢ S H AP E - w h et her i t i s m o r e o r l e ss
symmetrically distributed on the two sides of
the peak
➢ PEAK - were the distribution has its peak;
central location of the distribution
∑x
i =1
i
➢ DISPERSION - how widely dispersed it is on
both sides of the peak (first and last x=
n
occurrence)

Page 7 of 13
to largest (or vice versa)
Consider the following data on the duration of sleep (in o Find the middle value
hours) of 15 first year ✓ If the number of observations is odd, the
medical students the middle value is the median
night before a long n ✓ If the number of observations is even, the
exam: Xi median is the average of the two middle
X 1 + X 2 + X 3 + ... + X n ∑i =1 values
X= = ✓ (Middle position= {n+1}/2}
n n
Mean =
(2+4+6+3.5+2.5+6.
5+1.5+5.5+3+1.5+4
.5+3.7+4.2+1.6+3.4
)/15
Mean = 53.4/15 = 3.56

• The mean is very


sensitive to extreme
values – mean will be
pulled towards the outlier

Since the number of observations is odd, the


middle value is the median = 3.5

MODE

• Category or value that occurs most frequently


in a distribution
• The mode does not always exist; and if it does,
it may not be unique
➢Unimodal
➢Bimodal
➢Trimodal
• The mode is not affected by extreme values
• The mode can be used to summarize
variables measured in all scales of measurement.
- If there is an extremely high or low values, do not However, it is seldom used as a measure of central
use the mean; use the median tendency
• It is used to answer the question: “What is the
most frequently occurring value?”
MEDIAN • It is only used for nominal values

• The midpoint value (meaning the middle) in a set SUMMARY:


of ordered values or measurements
• A measure which can divide data into two
• Point at which one-half, or 50%, of the values
fall above and one-half, or 50%, fall below
✓ A positional measure

Steps in determining the median:


o Arrange the values or observations from smallest
Page 8 of 13
Mean Median Mode
■When variable is in interval or ratio ■When data is ordinal ■ When the data are in nominal scale
scale and without extreme values ■When the distribution is badly or categorical in nature
(too high or too low) skewed (with extremely low or high ■ When interested in knowing the
■When the distribution is normal or scores) typical case
is not greatly skewed ■When interested in determining ■ When the quickest estimate of

■ whether cases fall within the upper central value is wanted


or lower halves of the distribution

• If the mean and the median are equal, the


distribution of observations is symmetric (bell-
shaped curve) [B] PROPERTIES OF FREQUENCY DISTRIBUTION
• If the mean is larger than the median, the
distribution is skewed to the right (positively
skewed) [C]
• In positively skewed distribution, the mean is
pulled towards the extremely high value
• If the mean is smaller than the median, the
distribution is skewed to the left (negatively
skewed) [A]
• In negatively skewed distribution, the mean is
pulled towards the extremely low value

Page 9 of 13
• The simplest measure of variability
Measures of Dispersion / Variability (Spread) • Difference between the highest and lowest value
• Range = highest observation – lowest observation

Range = 6.5
– 1.5 = 5

• The mean and median age of clinic 1 are the same


as the mean and median age of Clinic 2
• However, the “ages” in Clinic 1 are more or less
concentrated around the mean and quite close to
each other, while in Clinic 2, the “ages”are
more different from each other – they are more
widely spread; this is important to “complete” the
summary of quantitative data
• It is important to describe the
continuous quantitative data completely by
computing for the mean and standard deviation
NOTE: are the patients in Clinic 1 has the same characteristic
at Clinic 2 in terms of age distribution? NO.
what to compute to say that they are different? You need to • Characteristics:
show the dispersion. compute for the measures of dispersion, ➢ Range has a tendency to increase as the
range, standard deviation, coefficient of variation number of observations increases. It uses
only the extreme values and does not provide
information on clustering or lack of clustering
of the values between the extremes
➢ An outlier can greatly affect its value
➢ The range can be computed for
variables measured in quantitative scale and on
an ordinal scale
• If there is an outlier, compute for the
interquartile range

INTERQUARTILE RANGE = Q3 - Q1

Inter-quartile Range = upper quartile– lower quartile


• Measures of absolute dispersion are expressed in the
units of the original observations. = Q3-Q1
• They cannot be used to compare variations of two
data sets when the observations differ in unit of = 75th percentile – 25th
measurements, or when the average values of the
observations greatly differ, while the measure of Q3 is the upper quartile; 7 5 t h percentile
relative dispersion is unit-less Q1 is the lower quartile; 25th percentile
Q2 is the median; 50th percentile

Measures of Absolute Variability:




RANGE

Page 10 of 13
Measures of Absolute Variability:


STANDARD DEVIATION

• Average amount of variability in a set of


observations
• Average distance from the mean
• The larger the SD, the larger the average distance
each data point is from the mean
• Square root of the variance

n
2
(x
∑ i − x )
i =1
s=
n −1

• Each score in this distribution differs from the


mean by an average of 1.61 hours
• Each student’s duration of sleep differs by 1.61
hours from the mean
✓ Subtract each observation (duration of
sleep) from the mean
✓ Get the square then add all the squared
values
Page 11 of 13
✓ Divide n – 1 then get the square root • Example: Which is more variable, systolic BP or
diastolic BP? Creatinine or potassium?
• Characteristics: • Answer: The means are a result of different
➢ Sensitive to extreme values. Therefore, it measures (i.e. SBP has a different measure from
is advisable to report the minimum and DBP; creatinine uses a different unit from
maximum value together with the SD, potassium), thus we cannot really compare. Then,
especially in the presence of extreme values because the standard deviation is relative to the
➢ I t h a s t he s a m e u n it a s t h e o ri g i n a l mean, these two CANNOT determine variability.
observations ➢ Most frequently used measure of
variation in medical literature. { Mean (SD) , Mean ± • Instead of using the standard deviation, we use
SD } relative variability (coefficient of variation) by
➢ It can be used to summarize quantitative data simply multiplying the quotient of the SD and
sets. It should not be used to summarize the mean by 100%
data sets in ordinal or nominal scale • CV expresses the SD as a percentage of the mean

sd
cv = *100%
X

Measure of Relative Variability



Coefficient of Variation (CV)
• Example: Which is more variable?
• Answer: They are more or less the same

• Characteristics
➢ Useful in comparing the results obtained by
different persons who are conducting
measurements involving the same variable but
different scales of measurement (QOL using two
different indices)
➢ When there is no variability, SD = 0, CV = 0
✓When SD = mean, CV = 100%
✓ When SD > mean, CV > 100%
✓The higher the CV, the higher the variability
➢ CV = (SD/mean) x 100%
✓It can be used only to summarize quantitative
data

Page 12 of 13
SUMMARY: DATA ANALYSIS PART I END OF TRANSCRIPTION
Processing Data - GIGO

Describing data
◦ Type of variables
◦ Independent, dependent, extraneous
variables
◦ Nominal, ordinal, interval, ratio
◦ Qualitative, quantitative discrete,
quantitative continuous

◦ Frequency Distributions
◦ Shape of the distribution (normal
distribution,skewed) Transcription Team 2019
Transcribed by: Heidi Cruz
◦ Measures of central tendency (mean, median,
mode) Edited by: …

References: Professor’s PPT,


◦ Measures of dispersion (range, std.deviation,
coefficient of variation) recordings, Batch 2018
transcription
Remarks: Good luck and God
bless :)

Page 13 of 13

S-ar putea să vă placă și