Documente Academic
Documente Profesional
Documente Cultură
The Anova analysis tools provide different types of variance analysis. The tool that you should use depends on the number of factors and the number of samples that you have from the populations that you want to test.
Whether the heights of plants for the different fertilizer brands are drawn from the same underlying population. Temperatures are ignored for this analysis.
Whether the heights of plants for the different temperature levels are drawn from the same underlying population. Fertilizer brands are ignored for this analysis. Whether having accounted for the effects of differences between fertilizer brands found in the first bulleted point and differences in temperatures found in the second bulleted point, the six samples representing all pairs of {fertilizer, temperature} values are drawn from the same population. The alternative hypothesis is that there are effects due to specific {fertilizer, temperature} pairs over and above the differences that are based on fertilizer alone or on temperature alone.
Correlation
The CORREL and PEARSON worksheet functions both calculate the correlation coefficient between two measurement variables when measurements on each variable are observed for each of N subjects. (Any missing observation for any subject causes that subject to be ignored in the analysis.) The Correlation analysis tool is particularly useful when there are more than two measurement variables for each of N subjects. It provides an output table, a correlation matrix, that shows the value of CORREL (or PEARSON) applied to each possible pair of measurement variables. The correlation coefficient, like the covariance, is a measure of the extent to which two measurement variables "vary together." Unlike the covariance, the correlation coefficient is scaled so that its value is independent of the units in which the two measurement variables are expressed. (For example, if the two measurement variables are weight and height, the value of the correlation coefficient is unchanged if weight is converted from pounds to kilograms.) The value of any correlation coefficient must be between -1 and +1 inclusive. You can use the correlation analysis tool to examine each pair of measurement variables to determine whether the two measurement variables tend to move together that is, whether large values of one variable tend to be associated with large values of the other (positive correlation), whether small values of one variable tend to be associated with large values of the other (negative correlation), or whether values of both variables tend to be unrelated (correlation near 0 (zero)).
Exponential Smoothing
The Exponential Smoothing analysis tool predicts a value that is based on the forecast for the prior period, adjusted for the error in that prior forecast. The tool uses the smoothing constant a, the magnitude of which determines how strongly the forecasts respond to errors in the prior forecast.
NOTE Values of 0.2 to 0.3 are reasonable smoothing constants. These values indicate that the current forecast should be adjusted 20 percent to 30 percent for error in the prior forecast. Larger constants yield a faster response but can produce erratic projections. Smaller constants can result in long lags for forecast values.
Fourier Analysis
The Fourier Analysis tool solves problems in linear systems and analyzes periodic data by using the Fast Fourier Transform (FFT) method to transform data. This tool also supports inverse transformations, in which the inverse of transformed data returns the original data.
Histogram
The Histogram analysis tool calculates individual and cumulative frequencies for a cell range of data and data bins. This tool generates data for the number of occurrences of a value in a data set. For example, in a class of 20 students, you can determine the distribution of scores in letter-grade categories. A histogram table presents the letter-grade boundaries and the number of scores between the lowest bound and the current bound. The single most-frequent score is the mode of the data.
Moving Average
The Moving Average analysis tool projects values in the forecast period, based on the average value of the variable over a specific number of preceding periods. A moving average provides trend information that a simple average of all historical data would mask. Use this tool to forecast sales, inventory, or other trends. Each forecast value is based on the following formula.
where:
N is the number of prior periods to include in the moving average Aj is the actual value at time j Fj is the forecasted value at time j
Regression
The Regression analysis tool performs linear regression analysis by using the "least squares" method to fit a line through a set of observations. You can analyze how a single dependent variable is affected by the values of one or more independent variables. For example, you can analyze how an athlete's performance is affected by such factors as age, height, and weight. You can apportion shares in the performance measure to each of these three factors, based on a set of performance data, and then use the results to predict the performance of a new, untested athlete. The Regression tool uses the worksheet function LINEST.
Sampling
The Sampling analysis tool creates a sample from a population by treating the input range as a population. When the population is too large to process or chart, you can use a representative sample. You can also create a sample that contains only the values from a particular part of a cycle if you believe that the input data is periodic. For example, if the input range contains quarterly sales figures, sampling with a periodic rate of four places the values from the same quarter in the output range.
t-Test
The Two-Sample t-Test analysis tools test for equality of the population means that underlie each sample. The three tools employ different assumptions: that the population variances are equal, that the population variances are not equal, and that the two samples represent before-treatment and after-treatment observations on the same subjects. For all three tools below, a t-Statistic value, t, is computed and shown as "t Stat" in the output tables. Depending on the data, this value, t, can be negative or nonnegative. Under the assumption of equal underlying population means, if t < 0, "P(T <= t) one-tail" gives the probability that a value of the t-Statistic would be observed that is more negative than t. If t >=0, "P(T <= t) one-tail" gives the probability that a value of the t-Statistic would be observed that is more positive than t. "t Critical one-tail" gives the cutoff value, so that the probability of observing a value of the t-Statistic greater than or equal to "t Critical one-tail" is Alpha. "P(T <= t) two-tail" gives the probability that a value of the t-Statistic would be observed that is larger in absolute value than t. "P Critical two-tail" gives the cutoff value, so that the probability of an observed t-Statistic larger in absolute value than "P Critical two-tail" is Alpha.
You can use a paired test when there is a natural pairing of observations in the samples, such as when a sample group is tested twice before and after an experiment. This analysis tool and its formula perform a paired twosample Student's t-Test to determine whether observations that are taken before a treatment and observations taken after a treatment are likely to have come from distributions with equal population means. This t-Test form does not assume that the variances of both populations are equal.
NOTE Among the results that are generated by this tool is pooled variance, an accumulated measure of the spread of data about the mean, which is derived from the following formula.
The following formula is used to calculate the degrees of freedom, df. Because the result of the calculation is usually not an integer, the value of df is rounded to the nearest integer to obtain a critical value from the t table. The Excel worksheet function TTEST uses the calculated df value without rounding, because it is possible to compute a value for TTEST with a noninteger df. Because of these different approaches to determining the degrees of freedom, the results of TTEST and this t-Test tool will differ in the Unequal Variances case.
z-Test
The z-Test: Two Sample for Means analysis tool performs a two sample z-Test for means with known variances. This tool is used to test the null hypothesis that there is no difference between two population means against either one-sided or two-sided alternative hypotheses. If variances are not known, the worksheet function ZTEST should be used instead. When you use the z-Test tool, be careful to understand the output. "P(Z <= z) one-tail" is really P(Z >= ABS(z)), the probability of a z-value further from 0 in the same direction as the observed z value when there is no difference between the population means. "P(Z <= z) two-tail" is really P(Z >= ABS(z) or Z <= -ABS(z)), the probability of a z-value further from 0 in either direction than the observed z-value when there is no difference between the population means. The two-tailed result is just the one-tailed result multiplied by 2. The z-Test tool can also be used for the case where the null hypothesis is that there is a specific nonzero value for the difference between the two population means. For example, you can use this test to determine differences between the performances of two car models.
The covariance also describes how linear a relationship is between two variables. The main difference between covariance and correlation is the range of values that each can assume. The Correlation between two variables can assume values only between -1 and +1. The Covariance between two variables can assume a value outside of this range. The more positive a covariance is, the more closely the variables move in the same direction. Conversely, the more negative a covariance is, the more the variables move in opposite directions. Two independent variables will have a zero Covariance. A Covariance of zero does not that two variables are independent though. The two variables may have a nonlinear relationship. This may not be picked up at all by the Covariance calculation. The Covariance calculation between two variables is very dependent upon the scale that the two variables are measured by. This is the main disadvantage of using Covariance instead of Correlation
to compare two variables. The Correlation Coefficient is not dependent upon the scale used and provides the ability to compare different sets of data consistently.
Regression
If 95% of the t distribution is closer to the mean than the t-value on the coefficient you are looking at, then you have a P value of 5%. This is also reffered to a significance level of 5%. The P value is the probability of seeing a result as extreme as the one you are getting (a t value as large as yours) in a collection of random data in which the variable had no effect. A P of 5% or less is the generally accepted point at which to reject the null hypothesis. With a P value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a random distribution, so you can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly. The 95% confidence interval for your coefficients shown by many regression packages gives you the same information. You can be 95% confident that the real, underlying value of the coefficient that you are estimating falls somewhere in that 95% confidence interval, so if the interval does not contain 0, your P value will be .05 or less.
Note that the size of the P value for a coefficient says nothing about the size of the effect that variable is having on your dependent variable - it is possible to have a highly significant result (very small P-value) for a miniscule effect.
Coefficients
In simple or multiple linear regression, the size of the coefficient for each independent variable gives you the size of the effect that variable is having on your dependent variable, and the sign on the coefficient (positive or negative) gives you the direction of the effect. In regression with a single independent variable, the coefficient tells you how much the dependent variable is expected to increase (if the coefficient is positive) or decrease (if the coefficient is negative) when that independent variable increases by one. In regression with multiple independent variables, the coefficient tells you how much the dependent variable is expected to increase when that independent variable increases by one, holding all the other independent variables constant. Remember to keep in mind the units which your variables are measured in. Note: in forms of regression other than linear regression, such as logistic or probit, the coefficients do not have this straightforward interpretation. Explaining how to deal with these is beyond the scope of an introductory guide.
We then create a new variable in cells C2:C6, cubed household size as a regressor. Then in cell C1 give the the heading CUBED HH SIZE.
(It turns out that for the se data squared HH SIZE has a coefficient of exactly 0.0 the cube is used). The spreadsheet cells A1:C6 should look like:
We have regression with an intercept and the regressors HH SIZE and CUBED HH SIZE The population regression model is: y = 1 + 2 x2 + 3 x3 + u It is assumed that the error u is independent with constant variance (homoskedastic) - see EXCEL LIMITATIONS at the bottom. We wish to estimate the regression line: y = b1 + b2 x2 + b3 x3
The only change over one-variable regression is to include more than one column in the Input X Range. Note, however, that the regressors need to be in contiguous columns (here columns B and C). If this is not the case in the original data, then columns need to be copied to get the regressors in contiguous columns. Hitting OK we obtain
Adjusted R Square 0.605016 Adjusted R2 used if more than one x variable Standard Error Observations 0.444401 This is the sample estimate of the standard deviation of the error u 5 Number of observations used in the regression (n)
The above gives the overall goodness-of-fit measures: R2 = 0.8025 Correlation between y and y-hat is 0.8958 (when squared gives 0.8025). Adjusted R2 = R2 - (1-R2 )*(k-1)/(n-k) = .8025 - .1975*2/2 = 0.6050.
The standard error here refers to the estimated standard deviation of the error term u. It is sometimes called the standard error of the regression. It equals sqrt(SSE/(n-k)). It is not to be confused with the standard error of y itself (from descriptive statistics) or with the standard errors of the regression coefficients given below. R2 = 0.8025 means that 80.25% of the variation of yi around ybar (its mean) is explained by the regressors x2i and x3i.
df SS
MS
Significance F
Regression 2 1.6050 0.8025 4.0635 0.1975 Residual Total 2 0.3950 0.1975 4 2.0
The ANOVA (analysis of variance) table splits the sum of squares into its components. Total sums of squares = Residual (or error) sum of squares + Regression (or explained) sum of squares. Thus i (yi - ybar)2 = i (yi - yhati)2 + i (yhati - ybar)2 where yhati is the value of yi predicted from the regression line and ybar is the sample mean of y. For example: R2 = 1 - Residual SS / Total SS (general formula for R2) = 1 - 0.3950 / 1.6050 (from data in the ANOVA table) = 0.8025 (which equals R2 given in the regression Statistics table). The column labeled F gives the overall F-test of H0: 2 = 0 and 3 = 0 versus Ha: at least one of 2 and 3 does not equal zero. Aside: Excel computes F this as: F = [Regression SS/(k-1)] / [Residual SS/(n-k)] = [1.6050/2] / [.39498/2] = 4.0635. The column labeled significance F has the associated P-value. Since 0.1975 > 0.05, we do not reject H0 at signficance level 0.05. Note: Significance F in general = FINV(F, k-1, n-k) where k is the number of regressors including hte intercept. Here FINV(4.0635,2,2) = 0.1975.
INTERPRET REGRESSION COEFFICIENTS TABLE The regression output of most interest is the following table of coefficients and associated output:
0.76440 1.1729 0.3616 -2.3924 0.42270 0.7960 0.5095 -1.4823 0.01311 0.1594 0.8880 -0.0543
Let j denote the population coefficient of the jth regressor (intercept, HH SIZE and CUBED HH SIZE). Then
Column "Coefficient" gives the least squares estimates of j. Column "Standard error" gives the standard errors (i.e.the estimated standard deviation) of the least squares estimates bj of j. Column "t Stat" gives the computed t-statistic for H0: j = 0 against Ha: j 0. This is the coefficient divided by the standard error. It is compared to a t with (n-k) degrees of freedom where here n = 5 and k = 3.
Column "P-value" gives the p-value for test of H0: j = 0 against Ha: j 0.. This equals the Pr{|t| > t-Stat}where t is a t-distributed random variable with n-k degrees of freedom and t-Stat is the computed value of the t-statistic given in the previous column. Note that this p-value is for a two-sided test. For a one-sided test divide this p-value by 2 (also checking the sign of the t-Stat).
Columns "Lower 95%" and "Upper 95%" values define a 95% confidence interval for j.
CONFIDENCE INTERVALS FOR SLOPE COEFFICIENTS 95% confidence interval for slope coefficient 2 is from Excel output (-1.4823, 2.1552). Excel computes this as b2 t_.025(3) se(b2)
= 0.33647 TINV(0.05, 2) 0.42270 = 0.33647 4.303 0.42270 = 0.33647 1.8189 = (-1.4823, 2.1552).
Other confidence intervals can be obtained. For example, to find 99% confidence intervals: in the Regression dialog box (in the Data Analysis Add-in), check the Confidence Level box and set the level to 99%.
TEST HYPOTHESIS OF ZERO SLOPE COEFFICIENT ("TEST OF STATISTICAL SIGNIFICANCE") The coefficient of HH SIZE has estimated standard error of 0.4227, t-statistic of 0.7960 and p-value of 0.5095. It is therefore statistically insignificant at significance level = .05 as p > 0.05. The coefficient of CUBED HH SIZE has estimated standard error of 0.0131, t-statistic of 0.1594 and p-value of 0.8880. It is therefore statistically insignificant at significance level = .05 as p > 0.05. There are 5 observations and 3 regressors (intercept and x) so we use t(5-3)=t(2). For example, for HH SIZE p = =TDIST(0.796,2,2) = 0.5095.
TEST HYPOTHESIS ON A REGRESSION PARAMETER Here we test whether HH SIZE has coefficient 2 = 1.0. Example: H0: 2 = 1.0 against Ha: 2 1.0 at significance level = .05. Then t = (b2 - H0 value of 2) / (standard error of b2 ) = (0.33647 - 1.0) / 0.42270 = -1.569. Using the p-value approach
p-value = TDIST(1.569, 2, 2) = 0.257. [Here n=5 and k=3 so n-k=2]. Do not reject the null hypothesis at level .05 since the p-value is > 0.05.
We computed t = -1.569 The critical value is t_.025(2) = TINV(0.05,2) = 4.303. [Here n=5 and k=3 so n-k=2]. So do not reject null hypothesis at level .05 since t = |-1.569| < 4.303.
We test H0: 2 = 0 and 3 = 0 versus Ha: at least one of 2 and 3 does not equal zero. From the ANOVA table the F-test statistic is 4.0635 with p-value of 0.1975. Since the p-value is not less than 0.05 we do not reject the null hypothesis that the regression parameters are zero at significance level 0.05. Conclude that the parameters are jointly statistically insignificant at significance level 0.05. Note: Significance F in general = FINV(F, k-1, n-k) where k is the number of regressors including hte intercept. Here FINV(4.0635,2,2) = 0.1975.
PREDICTED VALUE OF Y GIVEN REGRESSORS Consider case where x = 4 in which case CUBED HH SIZE = x^3 = 4^3 = 64. yhat = b1 + b2 x2 + b3 x3 = 0.88966 + 0.33654 + 0.002164 = 2.37006
EXCEL LIMITATIONS Excel restricts the number of regressors (only up to 16 regressors ??). Excel requires that all the regressor variables be in adjoining columns. You may need to move columns to ensure this. e.g. If the regressors are in columns B and D you need to copy at least one of columns B and D so that they are adjacent to each other. Excel standard errors and t-statistics and p-values are based on the assumption that the error is independent with constant variance (homoskedastic). Excel does not provide alternaties, such asheteroskedastic-robust or autocorrelation-robust standard errors and t-statistics and p-values. More specialized software such as STATA, EVIEWS, SAS, LIMDEP, PC-TSP, ... is needed.