Sunteți pe pagina 1din 3

Tutorial 4

1. The CDI (dataE08.txt) provides selected country demographic information (CDI) for 440 of the most populous counties
in the United States. Each line of the data set has an identification number with a county name and state abbreviation
and provides information on 14 variables for a single county. Counties with missing data ware deleted from the data set.
The information generally pertains to the years 1990 and 1992. The 17 variables are:

• Identification number: 1-440

• County: County name

• State: Two-letter state abbreviation

• Land area: Land area (square miles)

• Total population: Estimated 1990 population

• Percent of population aged 18-34: Percent of 1990 CDI population aged 18-34

• Percent of population 65 or older: Percent of 1990 CDI population aged 65 years old or older

• Number of active physicians: Number of professionally active nonfederal physicians during 1990

• Number of hospital beds: Total number of beds, cribs, and bassinets during 1990

• Total serious crimes: Total number of serious in 1990, including murder, rape, robbery, aggravated assault, burglary,
larceny-theft, and motor vehicle theft, as reported by law enforcement agencies

• Percent high school graduates: Percent of adult population (persons 25 years old or older) who completed 12 or more
years of school

• Percent bachelor’s degrees: Percent of adult population (persons 25 years old or older) with bachelor’s degrees

1
• Percent below poverty level: Percent of 1990 CDI population with income below poverty level

• Percent unemployment: Percent of 1990 CDI labor force that is unemployed

• Per capita income: Per capita income of 1990 CDI population (dollars)

• Total personal income: Total personal income of 1990 CDI population (in millions of dollars)

• Geographic region: Geographic region classification is that used by the U.S. Bureau of the Census, where: 1=NE,
2=NC, 3=S, 4=W

The number of active physicians (Y ) is to be regressed against total population (X1 ), total personal income (X2 )

(a) Let D1 = 1 if NE and 0 otherwise, D2 = 1 if NC and 0 otherwise, and D3 = 1 if S and 0 otherwise. Fit a linear
regression model to the number of active physicians (Y ) is to be regressed against total population (X1 ), total personal
income (X2 ) and D1 , D2 , D3

(b) Examine whether the effect for the northeastern region on number of active physicians differs from the effect for the
north central region.

(c) Test whether any geographic effects are present; use α = 0.01. State the alternatives, decision rule, and conclusion.

(d) Fit a linear regression model for Y on all the other variables including D1 , D2 and D3 and refine your model.

2. For the number of Passengers (AirlinePassengers.dat)

(a) Define dummy variable D1 , ..., D11 to denote the month

(b) Fit a model


Yt = β0 + β1 ∗ t + γ1 Dt1 + ... + γ11 Dt,11 + εt

where t is the number of month count from the first data (i.e. t = 1, 2, ... for the sequence of of observations)

2
(c) Fit another model

Yt = β0 + β1 ∗ t + γ1 Dt1 + ... + γ11 Dt,11 + +α1 ∗ t ∗ Dt1 + ... + α11 ∗ t ∗ Dt,11 + εt

(d) Which of the two models above should be used? interpret the model

(e) Refine the model you selected

3. For the baseball player’s salary data (BaseballplayerSalary.csv).

(a) Find the best transformation


y 0 = y λ , λ = 0, 0.1, 0.2, ..., 1

to fit the model

salary λ = β0 + β1 atbat + β2 hits + β3 homer + β4 ∗ runs + β5 ∗ rbi + β6 ∗ walks + β7 ∗ years + β8 ∗ atbatc

+β9 ∗ hitsc + β10 ∗ homerc + β11 ∗ runsc + β12 ∗ rbic + β13 ∗ walksc + β14 ∗ putouts

+β15 ∗ assists + β16 ∗ errors + 

Explain your criterion in selection the transformation

(b) Based on your fitted model above, plot the residuals again all the predictors, to see if you can observe any dependence
between the residuals and the predictors.

(c) Using 2-order polynomial to all the predictors, and fit the model and refine it.

(d) Do step (b) again for the model fitted in (c)

S-ar putea să vă placă și