Documente Academic
Documente Profesional
Documente Cultură
Data Analysis
Prepared By : Dr. Essam Morkos
Many people who readily acknowledge variation in their private lives fail to acknowledge the presence of variation in the workplace. Why does a negative 2.3 % change in hospital admissions from April to May send shock waves through the hospital? Why do hospitals get excited when the volume of procedures drops for three consecutive months? The answer to these questions is actually quite simplepeople really do not understand variation!
Wheeler (1993 (1993) ) offers a potential reason for this situation: Managers and workers, educators and students, , doctors and nurses, all have one thing in common. They come out of their educational experience knowing how to add, subtract, multiply, and divide;
yet they have no understanding of how to digest numbers to extract knowledge that may be locked up inside the data. In fact, this shortcoming is also seen, to a lesser extent, among engineers and scientists.
Data Management
DATA
Variable
A characteristic or condition that changes or has different values for different individuals. weight, height, gender, net income
Data
Ungrouped :
Data Sets
Grouped :
Frequency Table
Ungrouped Data
Data set (30.0, 32.0, 33.5, 32.0, 33.0, 29.0, 31.0, 32.5, 34.5, 33.5, 30.5, 30.0, 34.0, 32.0, 35.0, 32.5.) mg/dL 31.5, 29.5, 31.5, 32.0,
Grouped Data
Suppose that in thirty shots at a target, a marksman makes the following scores:
52234 13155 43203 24004 03215 54455
Frequency
4 3 5 5 6 7
Frequency (% )
13% 13 % 10% 10 % 17% 17 % 17% 17 % 20% 20 % 23% 23 %
Symmetry
Symmetry is implied when data values are distributed in the same way above and below the middle of the sample.
5 4 Frequency 3 2 1
Are easily interpreted; Allow a balanced attitude to outliers, that is, those above and below the middle value ( median) can be considered by the same criteria; Allow comparisons of spread or dispersion with similar data sets.
Many standard statistical techniques are appropriate only for a symmetric distributional form.
For this reason, attempts are often made to transform skew-symmetric data so that they become roughly symmetric.
Skewness
Skewness is defined as asymmetry in the distribution of the sample data values.
Values on one side of the distribution tend to be further from the 'middle' than values on the other side.
For skewed data, the usual measures of location will give different values,
for example, mode<median<mean indicate positive (or right) skewness. would
Positive (or right) skewness is more common than negative (or left) skewness. If there is evidence of skewness in the data, we can apply transformations,
for example, taking logarithms of positive skew data.
Transformations of ALT
Raw data Lambda=1 Logarithmic Lambda=0
1800 1600 1400 1200 1000 800 600 400 200 0
0 10 20 30 40 50 60 70
0 1 2 3 4 5 6
Over-Log Lambda=-0.5
0.5
1.5
2.5
Population
A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
Sample
A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group.
Parameter
A parameter is a value, usually unknown (and which therefore has to be estimated), used to represent a certain population characteristic.
For example, the population mean is a parameter that is often used to indicate the average value of a quantity.
Estimate
An estimate is an indication of the value of an unknown quantity based on observed data. More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter.
Statistic
A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population.
For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.
Sampling Distribution
The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from a population.
variables
are
Normal Distribution
All values are symmetrically distributed around the mean Characteristic shaped curve Assumed quality statistics bell5 4 Frequency 3 2 1 0 29 29.5 30 30.5 31 31.5 32 32.5 33 33.5 34 34.5 35 Value
N=50
10
11
12
13
14
15
N=2000
Both!
Students t-distribution
Also called Students t-distribution. A continuous distribution whose shape is similar to the Normal distribution and that is characterized by its degrees of freedom. It is particularly useful for inferences about the mean.
F test
(ANOVA Analysis of Variance )
A right skewed continuous distribution characterized by the degrees of freedom of the numerator and denominator of the ratio that defines it; It is useful for comparing two variances, and more than two means using the analysis of variance
Categorical Data
Categorical data are very common in medical research, arising when individuals are categorized into one of two or more mutually exclusive groups. In a sample of individuals, the number falling into a particular group is called a frequency frequency, so the analysis of categorical data is the analysis of frequencies. When two or more groups are compared , the data are often shown in the form of a frequency table.
If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include :
The number of children in a family, The number of patients in a doctor's surgery, The number of defective light bulbs in a box of ten.
Binomial Distribution
A binomial random variable is a discrete random variable that is defined when the conditions of a binomial experiment are satisfied. The conditions of a binomial experiment are given in Table below :
1. The total number of trials is fixed in advance; 2. There are just two outcomes of each trial; success and failure; 3. The outcomes of all the trials are statistically independent; 4. All the trials have the same probability of success.
Inferential statistics of SAMPLES from a population. Assumptions are made that the sample reflects the population in an unbiased form. Roman Notation: mean n sample size sum
Descriptive statistics are a way of summarizing the complexity of the data with a single number.
Descriptive Statistics
Tables Graphic representation Summary Statistics Central Tendency Dispersion
Depicting Variation
There are basically two options for depicting variation: 1. Static displays, and 2. Dynamic displays. Historically, the dominant approach to depicting variation has been to use static displays.
Static Display
A static display of variation occurs when data are presented in :
Tabular fashion, Graphical fashion;
e.g., bar charts, histograms, etc.
Measures of dispersion :
the range, standard deviation
Dynamic Display
A dynamic display occurs when the data are plotted on a run or control chart. A dynamic display allows you to see how the data vary over time.
Quality Control monitors both precision and the accuracy of the data in order to provide reliable results.
Random variation variation, which is inversely related to precision in measurement, and Nonrandom or systematic error, error which is related to a distortion in measurement, and inversely related to validity in measurement.
Descriptive Statistics
Tables Graphic representation Measures of central tendency Measures of dispersion
Each of these measures describes a group of values in a simple and concise manner, providing a mechanism to quickly see and evaluate the data.
Median :
The central value of a data set arranged in order
Mean :
The calculated average of all the values in a given data set
If two or more values are repeated at the same frequency, then each of those observations is a mode. In a normal or symmetrical distribution of data, the mean, median, and mode have the same values (or very close).
Range
Range is the difference or spread between the highest and lowest observations. It is the simplest measure of dispersion. It makes no assumption about the central tendency of the data.
Range
What is the Range? The range is calculated as the difference between the smallest and the largest values in a set of data. Heavily influenced by two most extreme values and ignores the rest of the distribution
Range
When is the Range Useful? The range is an adequate measure of variation for a small set of data, like class scores for a test. Think of other measures where range might be useful: Salaries for a particular job category; or Indoor versus outdoor temperatures?
Definition of Quartiles :
First quartile is P25 Second quartile is median or P50 Third quartile is P75
Calculation of Variance
Variance is the measure of variability about the mean. It is calculated as the average squared deviation from the mean.
s2 =
( x i x ) 2 /( n 1 )
the sum of the deviations from the mean, squared, divided by the number of observations (corrected for degrees of freedom)
Summary
In practice, descriptive statistics play a major role
Always the first 1-2 tables/figures in a paper Statistician needs to know about each variable before deciding how to analyze to answer research questions
In any analysis, 90% of the effort goes into setting up the data
Descriptive statistics are part of that 90%
TYPES OF DATA
Independent: each observation from a different subject Paired: two observations (eg, before and after some intervention, left and right eyes) in same subject, or in closely related subjects (e.g., siblings for genetics studies) Clustered: multiple observations on each subject When designing study and conducting analyses, need to use methods appropriate to data type
2. Statistical Significance Tests (Hypothesis Testing) Allow us to test a claim about a population
parameter More specifically, we test whether the observed sample statistic differs from some specified value
Confidence Interval
Confidence intervals contain two parts : 1. An interval within which the population parameter is estimated to fall ( estimate margin of error ) 2. A confidence level which states the probability that the method used to calculate the interval will contain the population parameter If we use a 95% 95% Confidence Interval, we have used a method that would give the correct answer 95% of the time when using random sampling In other words, 95% of samples from the sampling distribution would give a confidence interval that contains the population parameter This does not mean that the estimate is 95% 95% correct! Typically confidence levels of 95% or 99% are used, but sometimes 90% is used as well
Confidence Limits
Confidence limits are the lower and upper boundaries / values of a confidence interval, i.e., the values which define the range of a confidence interval. The upper and lower bounds of a 95% confidence interval are the 95% confidence limits.
These limits may be taken for other confidence levels, for example, 90%, 99%, 99.9%.
Confidence Level
The confidence level is the probability value associated with a confidence interval. It is often expressed as a percentage. For example, say , then the confidence level is equal to (1 - 0.05) = 0.95, i.e., a 95% confidence level. Example Suppose a football poll predicted that, if the match
were held today, the X team would win 60% of the vote. The pollster might attach a 95% confidence level to the interval 60% plus or minus 3%. i.e., he thinks it very likely that the X team would get between 57% and 63% of the total vote with a 95% confidence.
Confidence Interval
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision).
A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter.
Hypothesis Test
Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted HA.
These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis.
Null Hypothesis
The null hypothesis, H0, represents a theory that has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug.
We would write H0: there is no difference between the two drugs on average.
Alternative Hypothesis
The alternative hypothesis, HA, is a statement of what a statistical hypothesis test is set up to establish. For example, in a clinical trial of a new drug, the alternative hypothesis might be that the new drug has a different effect, on average, compared to that of the current drug.
We would write
HA: the two drugs have different effects, on average.
The alternative hypothesis might also be that the new drug is better, on average, than the current drug.
In this case we would write
HA: the new drug is better than the current drug, on average.
The final conclusion once the test has been carried out is always given in terms of the null hypothesis.
We either "Reject H0 in favor of HA" or "Do not reject H0". We never conclude "Reject HA", or even "Accept HA".
Type I Error
In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in fact true; that is, H0 is wrongly rejected rejected. .
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that the two drugs produced different effects when in fact there was no difference between them. The following table gives a summary of possible results of any hypothesis test:
Type II Error
In a hypothesis test, a type II error occurs when the null hypothesis H0 is not rejected when it is in fact false false.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type II error would occur if it was concluded that the two drugs produced the same effect, i.e., there is no difference between the two drugs on average, when in fact they produced different ones. A type II error is frequently due to sample sizes being too small.
Significance Level
The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis H0, if it is in fact true. It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. The significance level is usually denoted by Significance Level = P(type I error) = Usually, the significance level is chosen to be 0.05 (or equivalently, 5%).
P-Value
The probability of having observed your data - or more extreme data by chance alone - when the null hypothesis is true. It is the probability of wrongly rejecting the null hypothesis when it is in fact true. The P value for a test may be also defined as the smallest value of for which the null hypothesis can be rejected , i.e. , If the P value is less than or equal to , we reject the null hypothesis ; if the P value is grater than , we do not reject the null hypothesis.
P-Value Interpretation
Power
The power of a statistical hypothesis test measures the test's ability to reject the null hypothesis when it is actually false, i.e., to make a correct decision decision. . In other words, the power of a hypothesis test is the probability of not committing a type II error. It is calculated by subtracting the probability of a type II error from 1, usually expressed as: Power = 1 - P(type II error) = ( 1 - ) The maximum power a test can have is 1, the minimum is 0. Ideally we want a test to have high power, close to 1.
ESTIMATION vs TESTING
Sometimes primary goal is to describe data. Then we are interested in estimation. We estimate parameters such as Means Variances Correlations When primary goal is to draw a conclusion about a state of nature or the result of an experiment, we are interested in statistical testing
Regression Equation
A regression equation allows us to express the relationship between two (or more) variables algebraically. It indicates the nature of the relationship between two (or more) variables. In particular, it indicates the extent to which you can predict some variables by knowing others, or the extent to which some are associated with others. A linear regression equation is usually written Y = a + bX + e where Y is the dependent variable a is the intercept b is the slope or regression coefficient X is the independent variable (or covariate) E is the error term
The equation will specify the average magnitude of the expected change in Y given a change in X. The regression equation is often represented on a scatterplot by a regression line.
Correlation Coefficient
A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that there is no linear relationship between the variables. There are a number of different correlation coefficients that might be appropriate depending on the kinds of variables being studied.
Parametric Statistics
Originally, statisticians introduced parameterised distributions to make calculations easier Statistics based on parameterised distributions, such as the Normal distribution, are termed
parametric statistics
Non-parametric statistics were subsequently introduced to deal with situations where the data does not follow any easy equations
The statistics are based on the data itself
Parametric statistics
Can be used on parametric distributions Parametric distributions are those which can be described by parameters Gaussian Distribution defined by 2 parameters: Mean (average) indication of the center Standard deviation indication of scatter Symmetrical distribution (not skewed) 68.3% within +/- 1SD 95.4% within +/- 2SD 99.7% within +/- 3SD
Nonparametric Tests
FINAL COMMENTS
Statistics are only helpful if the approach taken is appropriate to the problem at hand Most statistical procedures are based on some assumptions about the characteristics of the data these need to be checked Remember GIGO
All of the following statements, concerning SPC methodology, are true, except: except:
1. SPC has great potential to help hospitals improving their performance. 2. It was developed by Deming. 3. Mainly deployed through using certain charts, graphs and diagrams. 4. Data collection and measurement are fundamental to SPC.
1. It is inherent in the process. 2. The process which exhibits (Common Cause Variation) is always functioning at an acceptable level. 3. It indicates that the process is stable and predictable within certain limits. 4. In some processes, (Special Cause Variation) may be better than (Common Cause Variation).
1. Height and weight 2. Community-acquired and nosocomial infection rates 3. Surgical or emergency department response time 4. Patient visits in the months of May and June
Questions?
essamwmw77 essamwmw 77@gmail.com @gmail.com