Sunteți pe pagina 1din 20

Day 4 and part of Day 5

Note: This is a new version of the lecture and must be replaced with the previous one that was given in the class!

Central Limit theorem and Hypothesis test


Instructor: Alireza Malehmir
Uppsala University Dept. of Earth Sciences Division of Geophysics March 23 and 26-2010

Todays agenda
Central Limit theorem Variance of sampling (error in sampling) Confidence interval around the mean Hypothesis tests
Fisher or F-test Student or t-test Chi-test

Central Limit theorem


General statement:
Let (X1, X2,X3, XN) be N societies, each with a finite variance of (1, 2, 3,... N). If Y (data varaiable) be defined as the arithmetic mean of Xis, then under a set of general conditions, if number of societies increases, Y closes to the normal distribution.

Formally: Consider a population with mean and standard deviation of , but which need not follow the normal distribution. Replicate samples from the population have means ( xs ) that show a central tendency to . The standard deviation of the sample means ( xs ) is given by: 2

var( xs ) =

ns

Known also as standard error

Central Limit theorem


The practical importance of the theorem is that many physical phenomena in nature arise from a number of additive variations. The distribution of the individual variations are often unknown, however, histogram of the summed variables is often observed to be approximately normal. In sampling theory: Means variance of n times repeating sampling is less than variance of once sampling.

Central Limit theorem: Example


Sampling from trucks working in a mine show the following: Var(x)=grade in one truck N=100 number of trucks carrying ore to the crusher per shift Day= 3 shifts

varshift ( xs ) = varday ( xs ) = A>B

2
ns

= =

2
100

=A =

2
ns

2
100 3

2
300

=B

Confidence interval
In general, whenever we speak about an estimation in statistics or geostatistics, we must also estimate errors related to the estimation within a predefined level of confidence. This helps to assess/evaluate the estimation.

X mean z < < X mean + z

= ( x) =

( x)
n

z is usually obtained from tables but for example for a


confidence level of 95% is 1.96

Confidence interval: Example


In a thorium sampling, within a confidence level of 95%, tell us what is the data minimum acceptable data value around the mean? Nsample=169, Mean=2700 g/ton Standard deviation= 900 g/ton

Warming up excercise
Let X1, X2, , Xn be random variables obtained from a population with a mean of and standard variation of . If X be the sampling means: prove

E( X ) =
var( X ) =

2
n

Regression: is a fit line to the data and can be expressed a straight line e.g., y=ax+b MATLAB has a function called polyfit to estimate a and b and higher polynomial orders. read this on your own!

Lets take a break!

(see page 94 for more info)

Hypothesis tests

These tests are used to analyze if there is for example a significant difference between two set of data samples (independent variances) or if two different approaches give similar results. Simply, is there any systematic error in the measurements or all are due to random error or chance. Fisher test: is to compare two independent variances

var( X 1 ) Ft = , var( X 1 ) > ( X 2 ) var( X 2 )


After calculating the Ft you need to obtain Fc (critical value) of 95% confidence from Table 4.6

Degree of freedom
What is the degree of freedom in statistics? It is the number of data values in the final calculation of a statistics that are free to vary. When you learned to calculate the variance of a sample population, you used the following formula: Degree of freedom
n 1 2 2 = ( x ) i n 1 i =1

In this case, because an estimate was made that the sample mean is a good estimate of the population mean, we have less degree of freedom than the number of independent observations. If we know both the mean and variance then the degree of freedom will be n-2.

Fisher test
If: Ft<Fc: For example: difference between two series of datasets can be neglected (maybe null hypothesis) For example: difference between two series of datasets is significant and there may a systematic error! (null hypothesis can be rejected) Degree of freedom

If: Ft>Fc:

Qa=n1-1 Qb=n2-1

Student or T-test
This is when number of samples are less than 30 and two mean values are obtained from two sampling in the population and variances obtained are good proxies for the population (they are similar to each other). The degree of freedom is often taken n1+n2-1 or if n1+n2 is less than 30 then is calculated from the equation 4.13 in page 94.
Only when n1, n2<30: If n1, n2>30 use eq. 4.12

tt =

X1 X 2 var( X 1 ) var( X 2 ) + n1 n2

Find the tc (critical value) from the table 4.5

Student or T-test
First was introduced by William Gosset (1876-1937) for small number of samples. The test compares the means of two distributions. W. Gosset was working for Irish Guinnes Brewery where he was not allowed to publish his research results. He published his tdistribution under the pseudonym Student (1908). - In this test variance of the two sets of measurements should be similar. - It can be used to test the hypothesis that both samples come from the same population, e.g., the same lithologic unit (null hypothesis), or from two different populations (alternative hypothesis)

Student test
If: tt<tc:

(to compare two sample means) For example: difference between two series of datasets can be neglected, may be null hypothesis

If: tt>tc:

For example: difference between two series of datasets is significant and there may a systematic error! Null hypothesis may be rejected.

T and F tests in MATLAB


MATLAB does not provide a ready-to-use F-test function but can easily be implemented. The most useful function here is finv to calculate F- critical values based on the degree of freedom and significance level. I give you a example in the class. e.g., (Qs are degree of freedoms) FC = finv(0.95, Qa , Qb );
Qa = lenght( X ) 1; Qb = lenght(Y ) 1;

For the t-test MATLAB has a function called ttest2 that can estimate parameters required for the hypothesis test. [h, significance, ci] = ttest 2( X , Y ,0.05);
if : h = 0 May be null hypothesis if : h = 1 Null hypothesis can be rejected

Example
With 95% confidence, find out if there is a significant difference between the two following sampling: var Series 1 Series 2 8.34 2.14 Mean 9.12 9.54 No. of samples 28 28

tt =

X1 X 2 = var( X 1 ) var( X 2 ) + n1 n2

9.54 9.12 = tt 8.43 2.14 + 28 28

From the table 4.5 tc can be interpolated for 95% confidence to be about 1.68

Chi-test

(a topic for the presentation!)

Was introduced by Karl Pearson (1900) and aims to find out if two distributions are derived from the same population. This is independent of the distribution used.

That is it for today!

S-ar putea să vă placă și