Sunteți pe pagina 1din 14

Abdullah Al Mahmud, MBA 2012 Application of Econometrics (ECON-212F1) Abstract In this paper three different data sets were

investigated using econometric models (multiple regressions, logit and count model). One data set focuses on the poverty rates, another one on determinants of computer use and last one on determinants of TV hours. 1. Model 1: Poverty Rates 1.1 Preface Poverty means the state of not having enough money to take care the basic needs such as food, clothing, housing, emotion and education. The poverty rate means the proportion of a population lives below the official poverty line. There are numerous factors related to poverty in United States including income, unemployment, family size and status, education, race and ethnicity, tax rates, and illegal immigration etc. This paper will construct and explain a model that looks to link the relationship between the identified factors and Poverty Rates. 1.2 Data Overview We have provided with a data set which contains information regarding poverty rates in 58 counties in California. This dataset explains the relationship and significance of Poverty Rates based on the percentage of urban population, family size, unemployment rate, high school education, college education and family income. The summary of Poverty Rates implies that the percentage of families with income below poverty line which ranges from 3% to 21%, with a mean of 9.90% and standard deviation of 4.00%. The percentage of urban population ranges from 2.7% to 94.3%, with a mean of 34% and standard deviation of 19.5%. The variable Family size is expressed in terms of persons per household which ranges from 2 to 3 persons, with a mean of 3 and standard deviation of 0.24. The unemployment rate ranges from 4% to 21.3%, with a mean of 10% and standard deviation of 4%. The variable highschl implies that the percentage population (25+) with only high school education ranges from 43% to 68.5%, with a mean of 57.6% and standard deviation of 6.21%. The variable college implies that the percentage population (25+) with 4+ college education ranges from 9% to 44%, with a mean of 19% and standard deviation of 8%. The median family income ranges from $24,364 to $59,147, with a mean of $35,338 and standard deviation of $8,264.

Before running a model, we should know whether the correlations are significant among the variables. According to the nature of the variables in our dataset, we found that all the variables are continuous. Therefore, we use pair wise correlation to interpret the relationship between Poverty Rates and other variables. The relationship between poverty rates and percentage of urban population is not significant (r= -0.0277; p 0.05). There is a significant positive relationship between poverty rates and family size (r = 0.3662; p 0.05). There is a highly significant positive correlation between poverty rates and unemployment rate (r = 0.7260; p 0.001). The relationship between poverty rates and the percentage population (25+) with only high school education is not significant (-0.1976; p 0.05). There is a highly significant negative correlation between poverty rates and the percentage population (25+) with 4+ college education (r = -0.6189; p 0.001). There is a significant negative relationship between poverty rates and median family income (r = -0.7820; p 0.001). 1.3 The Model According to data set, all the variables provided have been linked with a possible relationship to Poverty Rates. Therefore, we took Poverty Rates (the percentage of families with income below poverty line) as a Dependent variable in this case. The general framework for model is given below it is often referred as to kitchen sink (Ramanathan, chapter 4) is provided below: Poverty Rate = + 1urb + 2famsize + 3unemp + 4highschl + 5college + 6 medinc + Before actually estimating the model, it will be useful to discuss the signs that I would expect (priori), for the regression coefficients1. Generally, the poverty rates in urban areas tend to be higher thus the higher the urban population the higher the poverty rates, so we expect 1 to be positive. Large family size enhances the poverty. In this case, I would expect 2 to be positive. When people become unemployed it reduces their income which leads to poverty, thus 3 should be positive. It is expected that higher education enables people to earn more and thus reduces poverty. Therefore, we would expect 4 and 5 to be negative. The higher the median family income the lower the poverty rates. Thus, we would expect 6 to be negative.

Introductory Econometrics with Applications. Ramu Ramanathan. Page 170

Table 1: Initial Regression Model Variable Coefficient Standard Error urb -0.0187 0.015 famsize 6.0918 * 1.881 unemp -0.0118 0.119 highschl -0.1186 0.068 college 0.1711 0.098 medinc -0.5360 *** 0.070 AIC 232.18 BIC 246.60 Here, *= (p0.05), **= (p0.01), ***= (p0.001) This table summarizes the first model with coefficients and significant level. Since urban, unemployment, high school education, and college education are not significant as P values are more than 0.05, we can drop these variables. A Wald test shows whether these variables have contribution to this initial model and whether these variables can affect the model as a group. The test results tell us that these variables are significant (p 0.05), we can not drop these variables at a time. Let me do the dropping process and make second model except the variables that are not significant. To find out a good model for Poverty Rates, we drop the variables with largest p value, due to the fact that they contribute the less in the model. So, we decided to drop variable unemployment as it has highest p value 0f 0.922 among those variables and run the regression again. Table 2: Comparing Model of Poverty Rates Model 1 Model 2 (drop unemp) Model 3 (drop urb) Variable Coef. Coef. Coef. - urb -0.018 -0.018 5.414 * famsize 6.092 * 6.050 * - - unemp -0.012 highschl -0.119 -0.117 -0.139 * college 0.171 0.175 0.195 * medinc -0.536 *** -0.534 *** -0.552 *** R-Square 0.8362 0.8309 0.8362 Adj R - Square 0.8204 0.8181 0.8169 230.19 230.03 AIC 232.18 242.55 240.33 BIC 246.60 Here, *= (p0.05), **= (p0.01), ***= (p0.001) After dropping variable unemployment when we run the model again we get lower AIC and BIC of 230.19 and 242.55 respectively compare to Model 1 which has AIC and BIC of 232.18 and 246.60 respectively. So the model 2 is better than the previous model. But still the

2nd model has the non significant variable. The variable urban is not significant as it has the pvalue of 0.201 which is more than 0.05. Now we decided to drop the variable and run the model again. Now, this is our final model as it has only significant variable (p 0.05) and its AIC and BIC is the lowest which are 230.03 and 240.33 respectively. The variables in the final model explain about 83.09% of the variability of the Poverty Rates (R2 = 0.8303). Therefore, the new
model closes the gap between R-squared and adjusted R-squared (Adj- R2 = 0.8181).

So, our final model becomes: Poverty Rate = + 1 (5.414) famsize + 2 (-0.139) highschl + 3(0.195)college + 4(-0.552) medinc + This explains that if family size increases by 1 person per household, the poverty rates will increase by 5.41. Again, if the percentage population (25+) with only high school education increases by 1%, the poverty rates will decrease by 0.139. If percentage population (25+) with 4+ college education increases by 1%, the poverty rates will increase by 0.195. And if the median family income increases by $1,000, the poverty rates will decrease by 0.552. To find the possible collinearity, we use Variance Inflation Factor (VIF) test. Through this test we measure the possible collinearity of the explanatory variables2. However, the VIF test indicated that there is no multicollinearity problem (mean VIF=5.56). Therefore, I can say that there is no problem of collinearity. We conclude that this set of variables contribute best the model.

Lastly, we checked the heteroscedasticity. Regression models with heteroscedastic errors will make an error for the coefficient estimates, because increasing variance of the error term tends to lead to underestimated t- statistics. To solve this problem, it is necessary to know whether a heteroscedasticity problem exists. The results showed that our model does not have the
heteroskedasticity problem (chi2 = 1.78; df = 1, p0.05). So, we can conclude that, Model 3 is the

best Model and we need not to adjust our model for heteroscedasticity.

Table 3: Poverty Rates Model (Final Model) Variable famsize


2

Coefficient 5.414 *

Standard Error 0.014

Statistical Models for the Social and Behavioral Sciences. William Crown, chapter 5- page 75.

-0.139 * 1.814 * 0.195 0.065 -0.552 *** 0.091 0.8309 0.8181 230.03 AIC 240.33 BIC Here, *= (p0.05), **= (p0.01), ***= (p0.001) highschl college medinc R-Square Adj R - Square

2. Model 3: General Social Survey (GSS) from 2004 (Computer Usage) 2.1 Preface In this study we have focused on the factors such as education level, gender, marital status, age, race, and respondent income can affect the usage of the computer.

2.2 Data Overview We have provided with a data set which contains information regarding General Social Survey (GSS) from 2004 about 51020 respondents. This dataset explains the relationship and significance of a person using computer based on the gender, marital status, age, race, highest year of school completed and respondents income. The summary of using computer implies that 64.45 percent of the people observed are using computer, with a mean of 0.64 and standard deviation of 0.48. The variable female01 explains that 56% of the survey respondents are female and 44% of the survey respondents are male. The variable evermarried01 explains that 80.25% of the survey respondents are ever married and 19.75% are never married. The summary of age tells us that observed survey respondents range from 16 years to 121 years, with a mean of 59.5 years and standard deviation of 20 years. The variable black01 explains that 18.14% of the respondents are either black or others and 81.86% are white. The summary of education tells us that observed survey respondents highest year of school completed rages from 0 to 20 years, with a mean of 13 years and standard deviation of 3 years. Lastly, the survey respondents income ranges from $1,000 to $13,000, with a mean of $9,172 and standard deviation of $3,447. Before proceeding with the model, we should look up whether the correlations are significant among the variables. The chi2 square test tells us that there is no significant relationship between female and using computer variables (chi2 = 1.39, Pr >= 0.05). The chi2 square between use computer and ever married tells us that there is highly significant relationship between them. The t-test results of use computer and age variables tell us that there is highly significant relationship between them. (t= 27.61, p0.001). The t-test result between use computer and Education variables tells us that there is highly significant relationship between them (t= -34.81, p0.001). The results of t-test between use computer and RINCOME variables tell us that there is highly significant relationship between them (t= -9.27, p0.001). There is a highly significant relationship between use computer and black variables by using the chi- square test (chi2 = 44.05, Pr 0.001). 2.3 The Model A logit model is appropriate model to use in order to analyze this dataset as the dependent variable is dichotomous (using or not using computer). Since the use computer variable is dichotomous outcome variable, we can use logistic regression. The model is given as:

Use computer = f (female01, evermarri01, age, black01, EDUC, RINCOME) We would expect that as respondents income increases, the chances of using computer are increased. In addition, we would predict that the men spend more time on using computer than women. The people who have ever married they may have children or family and need to engage with work for income or otherwise. Now a day, people are using computer to teach their children as well. The younger people are more interested to engage with play, internet, gossiping, study and others than do older people. Thus, younger people tend to use computer more than older people. Table 1: Logit Model of Computer usage Variables Female01 Evermarri01 Age Black01 EDUC RINCOME LR chi2(6) Pseudo R2 AIC BIC Coefficient 0.37 *** 0.41 *** -0.04 *** -0.56 *** 0.39 *** 0.08 *** 867.42 0.1852 3830.20 3874.45 Here, *= (p0.05), **= (p0.01), ***= (p0.001) According to the results, all the other variables are contributing significantly in to our model. The variables female01, evermarri01, EDUC, and RINCOME have positive impact on determining whether one person is to use computer or not. On the other hand, the variables of age and black01 have negative impact on determining whether one person is to use computer or not. So, our final model becomes: Std. Err. 0.08 0.10 0.00 0.09 0.02 0.01

log (p/1-p) = -3.49 + 0.37*female01 + 0.41*evermarri01-0.04*age -0.56*black01+0.39*EDUC+0.08*RINCOME

The coefficient of female01 tells us that for a one-unit increase in female (in other words, going from male to female), we expect a 0.37 increase in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. The coefficient of ever married01 tells us that for a one-unit increase in evermarri01 (in other words, going from never married to evermarried), we expect a 0.41 increase in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. The age coefficient tells us that each additional year increase in age, we expect a 0.04 decrease in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. The coefficient of black01 tells us that for a one-unit increase in black01 (in other words, being a black or others in stead of white), we expect a 0.56 decrease in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. In addition, an increase in one level of education, we expect a 0.39 increase in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. Lastly, the coefficient of respondents income tells us that for a one unit increase in income, we expect a 0.08 increase in the log-odds of the dependent variable usecomputer01, holding all other independent variables constant. To find the possible collinearity, we use Variance Inflation Factor (VIF) test. Through this test we measure the possible collinearity of the explanatory variables3. However, the VIF test indicated that there is no multicollinearity problem (mean VIF=1.12). Therefore, I can say that there is no problem of collinearity. We conclude that this set of variables contribute best the model. Lastly, we checked the heteroscedasticity. Regression models with heteroscedastic errors will make an error for the coefficient estimates, because increasing variance of the error term tends to lead to underestimated t- statistics. To solve this problem, it is necessary to know whether a heteroscedasticity problem exists. The results showed that our model have the
heteroskedasticity problem. So, we adjust our model for heteroscedasticity.

Table 2: Logit Model of Computer usage Variables Female01 Evermarri01


3

Coefficient 0.37 *** 0.41 ***

Robust Std. Err. 0.08 0.10

Statistical Models for the Social and Behavioral Sciences. William Crown, chapter 5- page 75.

Age Black01 EDUC RINCOME LR chi2(6) Pseudo R2 AIC BIC Conclusion

-0.04 *** -0.56 *** 0.39 *** 0.08 *** 867.42 0.1852 3830.20 3874.45

0.00 0.09 0.02 0.01

Here, *= (p0.05), **= (p0.01), ***= (p0.001) In summary, we are going to present some exemplars in the following table. Table 3: Exemplars using the logit model of Computer usage Female or Male Female Female Female Male Male Ever Married or Never Married Ever Married Ever Married Never Married Ever married Never Married Age Race Education Respondent Income $9 thousand $12 thousand $ 6 thousand $10 thousand $ 6 thousand Probability of using computer (%) 53.7% 93.9% 15.3% 44.6% 90.5%

60 yr 25 yr 65 yr 60 yr 25 yr

Black Black White Black White

13 years 15 years 9 years 13 years 16 years

Here, we can conclude that a younger black married female who has higher education and high earnings has higher probability of using computer (93.9%) compare to a older black married female who has low education and low income 53.7%. As we expected, an old white unmarried female whom education and income is below the mean has low probability of using computer (15.3%). On the other hand, A younger unmarried white male who has higher education but low earnings has higher probability of using computer (90.5%) compare to old married black male who has average education and high earnings (44.6%). 3. Model 3: Determinants of TV Hours 3.1 Preface In this study we have focused on the factors such as education level, gender, marital status, age, race, and respondent income can affect the the number of hours a day that individuals watch television .

3.2 Data Overview We have provided with a data set which contains information regarding General Social Survey (GSS) from 2004 about 51020 respondents. This dataset explains the relationship and significance of TVHOURS based on the sex, marital status, age, race, highest year of school completed and respondents income. The summary of TVHOURS implies that the number of hours a day that individuals watch television ranges from 0 hour to 24 hours, with a mean of 3 hours and standard deviation of 2.3 hours. The variable female01 explains that 56% of the survey respondents are female and 44% of the survey respondents are male. The variable evermarried01 explains that 80.25% of the survey respondents are ever married and 19.75% are never married. The summary of age tells us that observed survey respondents range from 16 years to 121 years, with a mean of 59.5 years and standard deviation of 20 years. The variable black01 explains that 18.14% of the respondents are either black or others and 81.86% are white. The summary of education tells us that observed survey respondents highest year of school completed rages from 0 to 20 years, with a mean of 13 years and standard deviation of 3 years. The survey respondents income ranges from $1,000 to $13,000, with a mean of $9,172 and standard deviation of $3,447. Before proceeding with the model, we should look up whether the correlations are significant among the variables. The variables TVHOURS, age, education, and respondents income are continuous in nature. Therefore, we use pair wise correlation to interpret the relationship among them. There is a significant positive relationship between TVHOURS and age (r = 0.0824; p 0.001). There is a significant negative correlation between TVHOURS and education (r = -0.2314; p 0.001). Furthermore, There is a significant negative correlation between TVHOURS and respondents income (r = -0.1634; p 0.001). For the relationship among the variables TVHOURS, female01, evermarried01, and black01, we use t-test because TVHOURS is a continuous variable and female01, evermarried01, and black01 are categorical variables. The survey respondents who are female spend more hours on watching TV than male (t=-11.29, df=29146.9, p 0.001). The survey respondent who have ever married spend less time for watching television than those who are never married (t=0.44, df=8714.07, p 0.001). The survey respondents who are either black or others spend more time on watching television than white people (t=-20.24, df=6439.09, p 0.001). 3.3 The Model

A count model is an appropriate model to analyze this data since the dependent variable is discrete (hours per day watching television, which cannot be negative), the data is not normally distributed and it is a count variable. The model is given as: TVHOURS = f (female01, evermarried01, age, black01, education and RINCOME) Television news and advertisement could enable them for those purposes. Historically females have the habit of watching television for drama, and entertainment. We would expect that they will spend more time for watching television per day. The people who have ever married they may have children or family and need to engage with work for income or otherwise. Therefore, we expect that they will spend less time on watching television per day. The younger people are more interested to engage with play, internet, gossiping, study and others than do older people. Thus, the more the age the more hours spending per day on watching television. The more you work the more you earn. Therefore, to get higher income a person need to work more time. Thus, an increase in income reduces time for watching television per day. The above table summarizes the four different regression models including Poisson distribution, Negative Binomial Regression (NBREG), Zero Inflated Poisson (ZIP), and Zero Inflated Negative Binomial Regression (ZINB) model. Among these four regression model the Negative Binomial Regression model has the lowest AIC and BIC which are 72485.22 and 72548.25 respectively. The gap between AIC and BIC is also lowest for NBREG model. Therefore, we would expect that the NBREG model describe our dependent variable TVHOURS more accurately than the other models. So, we decided continue with this model.

Variable female01 evermarri~01 age black01

Table 1: Comparison of Different types of Models Poisson NBREG ZIP Coef. Coef. Coef. -0.03 *** -0.03 ** -0.03 *** -0.01 -0.01 -0.01 0.00 *** 0.00 *** 0.00 *** 0.24 *** 0.24 *** 0.24 ***

ZINB Coef. -0.03 ** -0.01 ** -0.002 *** 0.23 ***

EDUC RINCOME INFLATE female01 evermarri~01 age black01 EDUC RINCOME lnalpha alpha AIC BIC

-0.04 *** -0.02 *** -

-0.04 *** -0.02 *** -2.55 0.08 -

-0.04 *** -0.03 *** -0.35 0.79 0.00 12.42 -0.22 -0.26

-0.04 *** -0.02 *** -0.62 13.60 -0.02 14.34 0.62 -27.88 -2.56 0.07

73034.08 73089.23

72485.22 72548.25

73010.08 73120.38

72493.22 72587.77

Here, *= (p0.05), **= (p0.01), ***= (p0.001) Now we run the model again by using our selected Negative Binomial Regression (NBREG). The result summarizes in the following table: Table 2: Initial Model for TVHOURS (by using NBREG) Standard Variable Coefficient Error female01 -0.03 ** 0.0102 evermarri~01 -0.01 0.0128 age 0.00 *** 0.0003 black01 0.24 *** 0.0123 EDUC -0.04 *** 0.0021 RINCOME -0.02 *** 0.0015 lnalpha -2.55 alpha 0.08 LR chi2(7) 1638.81 Pseudo R2 0.02 72485.22 AIC 72548.25 BIC Here, *= (p0.05), **= (p0.01), ***= (p0.001) The above table summarizes the first model with coefficients and significant level. Since the variable evermarried01 is not significant as P-value is 0.339 (p0.05), we can drop this variable and run this model again. All the other variables are contributing significantly in to our model. Table3: Comparing Model of TVHOURS (by using NBREG) Variable Model 1 Model 2 (drop evermarried01)

Coef. -0.03 *** - *** 0.00 *** *** 0.24 *** *** -0.04 *** *** -0.02 *** -2.55 0.08 1637.90 LR chi2(7) 1664.49 0.02 Pseudo R2 0.02 72485.22 72484.14 AIC 72548.25 72539.29 BIC Here, *= (p0.05), **= (p0.01), ***= (p0.001) female01 evermarri~01 age black01 EDUC RINCOME lnalpha alpha ** After dropping variable evermarried01 when we run the model again we get lower AIC and BIC of 72484.14 and 72539.29 respectively compare to Model 1 which has AIC and BIC of 72485.22 and 72548.25 respectively. So the model 2 is the best model. As we can see from the above table, all the variables of Model are significant (p 0.05), we need not to drop any more variable. So, our final model becomes: TVHOURS = f (female01, age, black01, education and RINCOME) According to the results, females spend 0.03 less hours per day for watching television than do male, while holding the other variables in the model constant (z=-3.27, p<=0.001). In addition, each additional year increase in age increases the hours per day watching television by 0.002, while holding the other variables in the model constant (z=-7.50, p<=0.001). On the other hand, Black person spend 0.24 more hours per day watching television, while holding the other variables in the model constant (z=19.30, p<=0.001). A one year increase in schooling will decrease 0.04 hours per day watching television, while holding the other variables in the model constant (z=-23.66, p<=0.01). As expected, one unit increase in respondent income will decrease 0.02 hours per day watching television, while holding the other variables in the model constant (z=-16.72, p<=0.01). As the dispersion parameter, alpha, is significantly greater than zero, the data are over dispersed and are better estimated using a negative binomial model than a Poisson model. The large test statistic (chibar2(01) = 551) would suggest that the response variable is over-dispersed and is not sufficiently described by the simpler Poisson distribution.

Coef. -0.03 -0.01 0.00 0.24 -0.04 -0.02 -2.55 0.08

To find the possible collinearity, we use Variance Inflation Factor (VIF) test. Through this test we measure the possible collinearity of the explanatory variables4. However, the VIF test indicated that there is no multicollinearity problem (mean VIF=1.09). Therefore, I can say that there is no problem of collinearity. We conclude that this set of variables contribute best the model. Lastly, we checked the heteroscedasticity. Regression models with heteroscedastic errors will make an error for the coefficient estimates, because increasing variance of the error term tends to lead to underestimated t- statistics. To solve this problem, it is necessary to know whether a heteroscedasticity problem exists. The results showed that our model may have the heteroskedasticity problem (chi2 = 1640.84; df = 1, p<=0.001). So, we adjust our model for heteroscedasticity. Table 4: The Final Model for TVHOURS corrected for Heteroscedastic (Using NBREG) Variable female01 age black01 EDUC RINCOME lnalpha alpha Coefficient -0.03 0.00 0.24 -0.04 -0.02 -2.55 0.08 Robust Standard Error 0.01 0.00 0.01 0.00 0.00

** *** *** *** ***

Wald chi2(6) 1295.76 72484.14 AIC 72539.29 BIC Here, *= (p0.05), **= (p0.01), ***= (p0.001) 3.4 Conclusion In summary, we are going to present exemplars by using above the formulas. A 30 years old black woman who has 13 years of grades of school, and has the 3rd level of income has the probability of less days spend of TV HOURS. Also, a 30 years old white man who has 15 years of grades school, and has 3rd level of income spend more time on TV. In the same way, 60 years old black man who has never married, has 12 years of grades school, and has 1st level of income spend more time on TV.

Statistical Models for the Social and Behavioral Sciences. William Crown, chapter 5- page 75.