Documente Academic
Documente Profesional
Documente Cultură
Note: This is a new version of the lecture and must be replaced with the previous one that was given in the class!
Todays agenda
Central Limit theorem Variance of sampling (error in sampling) Confidence interval around the mean Hypothesis tests
Fisher or F-test Student or t-test Chi-test
Formally: Consider a population with mean and standard deviation of , but which need not follow the normal distribution. Replicate samples from the population have means ( xs ) that show a central tendency to . The standard deviation of the sample means ( xs ) is given by: 2
var( xs ) =
ns
2
ns
= =
2
100
=A =
2
ns
2
100 3
2
300
=B
Confidence interval
In general, whenever we speak about an estimation in statistics or geostatistics, we must also estimate errors related to the estimation within a predefined level of confidence. This helps to assess/evaluate the estimation.
= ( x) =
( x)
n
Warming up excercise
Let X1, X2, , Xn be random variables obtained from a population with a mean of and standard variation of . If X be the sampling means: prove
E( X ) =
var( X ) =
2
n
Regression: is a fit line to the data and can be expressed a straight line e.g., y=ax+b MATLAB has a function called polyfit to estimate a and b and higher polynomial orders. read this on your own!
Hypothesis tests
These tests are used to analyze if there is for example a significant difference between two set of data samples (independent variances) or if two different approaches give similar results. Simply, is there any systematic error in the measurements or all are due to random error or chance. Fisher test: is to compare two independent variances
Degree of freedom
What is the degree of freedom in statistics? It is the number of data values in the final calculation of a statistics that are free to vary. When you learned to calculate the variance of a sample population, you used the following formula: Degree of freedom
n 1 2 2 = ( x ) i n 1 i =1
In this case, because an estimate was made that the sample mean is a good estimate of the population mean, we have less degree of freedom than the number of independent observations. If we know both the mean and variance then the degree of freedom will be n-2.
Fisher test
If: Ft<Fc: For example: difference between two series of datasets can be neglected (maybe null hypothesis) For example: difference between two series of datasets is significant and there may a systematic error! (null hypothesis can be rejected) Degree of freedom
If: Ft>Fc:
Qa=n1-1 Qb=n2-1
Student or T-test
This is when number of samples are less than 30 and two mean values are obtained from two sampling in the population and variances obtained are good proxies for the population (they are similar to each other). The degree of freedom is often taken n1+n2-1 or if n1+n2 is less than 30 then is calculated from the equation 4.13 in page 94.
Only when n1, n2<30: If n1, n2>30 use eq. 4.12
tt =
X1 X 2 var( X 1 ) var( X 2 ) + n1 n2
Student or T-test
First was introduced by William Gosset (1876-1937) for small number of samples. The test compares the means of two distributions. W. Gosset was working for Irish Guinnes Brewery where he was not allowed to publish his research results. He published his tdistribution under the pseudonym Student (1908). - In this test variance of the two sets of measurements should be similar. - It can be used to test the hypothesis that both samples come from the same population, e.g., the same lithologic unit (null hypothesis), or from two different populations (alternative hypothesis)
Student test
If: tt<tc:
(to compare two sample means) For example: difference between two series of datasets can be neglected, may be null hypothesis
If: tt>tc:
For example: difference between two series of datasets is significant and there may a systematic error! Null hypothesis may be rejected.
For the t-test MATLAB has a function called ttest2 that can estimate parameters required for the hypothesis test. [h, significance, ci] = ttest 2( X , Y ,0.05);
if : h = 0 May be null hypothesis if : h = 1 Null hypothesis can be rejected
Example
With 95% confidence, find out if there is a significant difference between the two following sampling: var Series 1 Series 2 8.34 2.14 Mean 9.12 9.54 No. of samples 28 28
tt =
X1 X 2 = var( X 1 ) var( X 2 ) + n1 n2
From the table 4.5 tc can be interpolated for 95% confidence to be about 1.68
Chi-test
Was introduced by Karl Pearson (1900) and aims to find out if two distributions are derived from the same population. This is independent of the distribution used.