Sunteți pe pagina 1din 7

1

STAB22 Data Analysis Project Instruction


New Due Date: Tuesday July 30th
Submit on Quercus by 11:59p.m.

Purpose:
The objective of this project is to give you the opportunity to use some of the statistical techniques that you
have learned in this course for exploring a real data set.

Submission Format of Data Analysis Report:


Based on the analysis of an open data retrieved from OECD.Stat, you will work on described questions on
pages 5 and 6 of this document and give your answers in a template posted on our Quercus page.

You may work individually or in groups of no more than three students. Your group members can be from
different tutorial sections in our class. If you are working in a group, please think of creating a team name
and note that you will only submit on Quercus one report (filled-in template and StatCrunch outputs) on or
before the project’s due date. The ways in which your analysis of this data will be assessed is described on
page 7 of this document.

Context of Data:

The Organisation of Economic Cooperation and Development (OECD) gathers various information
regarding OECD countries and its partners in order to promote policies that aims to improve the economic
and social well-being of people around the world (http://www.oecd.org/about/).
This agency collects quantitative information on many domains and makes the collected data available for
public use (e.g., researchers) so that interested individuals can further investigate relationships among a set
of variables. A domain named “Social Protection and Well-being” includes a yearly collection of data
“Better Life Index” from OECD countries. This information can be retrieved from: http://stats.oecd.org
2

From the “Better Life Index” (BLI), the most recent data published in 2019 but collected in 2018, we will
analyze a quantitative variable named “Life Satisfaction”. Information regarding this variable can be
retrieved from: http://www.oecd.org/statistics/OECD-Better-Life-Index-definitions-
2019.pdf#_ga=2.145820212.1027110605.1559482147-696144184.1473183978
(note that this definition-document is posted on our Quercus page in the module Data Analysis Project).
This variable “considers people's evaluation of their life as a whole. It is a weighted-sum of different
response categories based on people's rates of their current life relative to the best and worst possible lives
for them on a scale from 0 to 10, using the Cantril Ladder (known also as the "Self-Anchoring Striving
Scale")” (BLI, 2019). OECD indicates that they obtained this information based on a certain Poll.
 Let us recap the variables of interest in our data analysis:
1. Mean value of people’s life satisfaction
2. Gender of the respondents identified as Male, Female
 I recommend that you read about this data here:
http://www.oecdbetterlifeindex.org/#/11111111111
Also, click on “Life Satisfaction” on the right hand-side menu to be directed to another web-link:
http://www.oecdbetterlifeindex.org/topics/life-satisfaction/
Scroll down that page; you can click on and read about each country’s life satisfaction score

StatCrunch Activity:
1. Understanding and comparing distributions of life satisfaction scores for males and for females in the
36 OECD countries.
2. Describing the distribution of differences between females and males’ life satisfaction scores in the 36
OECD countries.
3. Examining the relationship between males and females’ life satisfaction scores in the 36 OECD
countries. We aim to predict males’ life satisfaction scores from females’ life satisfaction scores.

Overview of Steps:
1. Save the following two data files on your computer (e.g., My Document folder):
 Data Set 1_Life Satisfaction_BLI2019.csv
 Data Set 2_Life Satisfaction_BLI2019.csv

2. Open each of the saved file (above). Add your last name or team name to the following variable names:
 In “Data Set 1_Life Satisfaction_BLI2019.csv”, modify the variable name “Life Satisfaction”
 In “Data Set 2_Life Satisfaction_BLI2019.csv”, modify the variable names:
“Life Satisfaction_Female”, “Life Satisfaction_Male, “Diff Life Satisfaction”
3. Re-save the two excel files that you have just modified their column heading names with your lastname
or your team name.

4. Follow the steps below to produce your StatCrunch outputs for the analysis of life satisfactions for
males and females. Work on the related questions on pages 5 and 6 of this document.

5. Submit two PDF files on Quercus (in Assignment page: Project):


 Filled-in template: Data Analysis.pdf
 StatCrunch Outputs.pdf
3

StatCrunch Activity #1 Instruction - Use “Data Set 1_Life Satisfaction_BLI2019.csv”:


“Comparing distributions of life satisfaction scores between males and females”

Access StatCrunch from:


 MyLab & Mastering tab in Quercus, only if you registered your purchased access code.
OR
 Visit www.statcrunch.com to enter the site with your six-months access that you purchased.

1. From the horizontal menu bar, click on “Open StatCrunch”

2. In the next horizontal menu bar, click on Data > Load > From file> On my computer

3. In the screen that opens, under “Load data from my computer”, click on “Choose File”

 From your computer, find the data file that you saved from Quercus (and modified/re-saved):
“Data Set 1_Life Satisfaction_BLI2019.csv” and select/open it (to be loaded).
 At the bottom of the screen that you are currently in StatCrunch, click on “Load File”

4. Obtain summary statistics for each distribution of life satisfaction scores for males and for females:

 Click on Stat > Summary Stats > Columns


 Under Select Columns, select: Life Satisfaction
 Under Group by, select: Gender
 Click on “Compute”
 Copy and paste your output into a word document that you are preparing as your StatCrunch outputs.

5. Obtain a side-by-side boxplot of life satisfaction scores by gender:

 Click on Graph > Boxplot


 Under Select Columns, select: Life Satisfaction
 Under Group by, select “Gender”
 Click on “Compute”
 Copy and paste your side-by-side boxplots into a word document that you are preparing as your
StatCrunch outputs.

StatCrunch Activity #2 Instruction - Use “Data Set 2_Life Satisfaction_BLI2019.csv”:


“Describing the distribution of differences between females and males’ life satisfaction scores”

1. From the horizontal menu bar, click on “Open StatCrunch”

2. In the next horizontal menu bar, click on Data > Load > From file> On my computer

3. In the screen that opens, under “Load data from my computer”, click on “Choose File”

 From your computer, find the data file that you saved from Quercus (and modified/re-saved):
“Data Set 2_Life Satisfaction_BLI2019.csv” and select/open it (to be loaded).
 At the bottom of the screen that you are currently in StatCrunch, click on “Load File”
4

4. Obtain summary statistics for the distribution of differences in females & males’ life satisfaction scores:

 Click on Stat > Summary Stats > Columns


 Under Select Columns, select: Diff Life Satisfaction
 Click on “Compute”
 Copy and paste your output into a word document that you are preparing as your StatCrunch outputs.

5. Obtain a boxplot of the distribution of differences between females and males’ life satisfaction scores:

 Click on Graph > Boxplot


 Under Select Columns, select: Diff Life Satisfaction
 Under Options, check off (select) the following two options”
“Use fences to identify outliers”
“Draw boxes horizontally”
 Click on “Compute”
 Copy & paste your boxplot into a word document that you are preparing as your StatCrunch outputs.

StatCrunch Activity #3 Instruction -Use “Data Set 2_Life Satisfaction_BLI2019.csv”:


“Examining the relationship between males and females’ life satisfaction scores”

1. Obtain Scatterplot of data.

 Click on Graph > Regression > Scatter Plot


 For X-variable, select: Life Satisfaction_Female
 For the Y-variable, select: Life_Satisfaction_Male
 Click on “Compute”.
 Copy & paste your plot into a word document that you are preparing as your StatCrunch outputs.

2. Conduct a regression analysis: Predict males’ life satisfaction scores from the females.

 Click on Stat > Regression > Simple Linear


 For X-variable, select: Life Satisfaction_Female
 For the Y-variable, select: Life_Satisfaction_Male
 Under “Graph”, select the following four graphs by holding the control key on your keyboard:
“Fitted line plot”; “Histogram of residuals”; “QQ plot of residuals”; “Residuals vs. X-values”
 Under “Save”, select “Residuals”
 Click on “Compute”.
 Copy & paste your outputs into a word document that you are preparing as your StatCrunch outputs.

3. Obtain summary statistics for the distribution of residuals.

 Click on Summary Stats > Columns


 Under Select column(s), choose: “Residuals”
 Under Statistics, select the following statistics by holding the control key on your keyboard
“n, Mean, Min, Max, Sum”
 Click on “Compute”.
 Copy & paste your output into a word document that you are preparing as your StatCrunch outputs.
5

4. Obtain a boxplot for the distribution of residuals.

 Copy & paste your plot into a word document that you are preparing as your StatCrunch outputs.
 Click on Graph > Boxplot
 Under Select column(s), choose: “Residuals”
 Under Options, select: “Draw boxes horizontally”
 Click on “Compute”.
 Copy & paste your plot into a word document that you are preparing as your StatCrunch outputs.

Related Questions

Part 1. Identify the Elements of Statistics and Method of Data Collection.

1. Who are the cases in this study?


2. Identify the population of interest in the context of this study.
3. Identify the sample in the context of this study.
4. Identify the population parameter(s) of interest in the context of this study.
5. What is/are the variable(s) of interest in this study? Identify their type and their scale of measurements.
6. Think about the purpose of this study. Why this study was conducted?
7. Where was this study conducted?
8. When was the study conducted?
9. How was the data for this study collected? Hint: Read the web-page on OECD community:
http://www.oecdbetterlifeindex.org/topics/community/ to find the answer to this question.

Part 2. Compare life satisfaction scores between males and females.


1. Suppose that the researchers are interested to investigate the relationship between life satisfaction scores
and the gender of the respondents in the 36 OECD countries. Identify the response variable and the
explanatory variable in the context of this study.
2. Use the side-by-side boxplots and the summary statistics to compare distributions of life satisfaction
scores of males and females in the OECD countries. That is, compare the shapes, centres, and spreads of
both distributions and note/identify any outliers.
3. Use the boxplot and summary statistics for the differences between females and males’ life satisfaction
scores (in each country) to describe what is apparent in this plot that is not apparent in the other boxplots
(the side-by-side boxplots of life satisfactions scores by gender). Describe the shape, centre, and spread of
this distribution. Indicate which countries are suspect outliers (pointed individually on the boxplot) and
what makes them unusual. That is, use the 1.5IQR rule to determine whether the outlying points are
suspect outliers. Also, find the number of standard deviations that the potential outlier(s) is/are away from
the overall mean of this distribution. Discuss why this graph (boxplot of differences) is more useful for
learning about differences between males and females in the OECD countries?
6

Part 3. Predict males’ life satisfaction scores from females’ life satisfaction scores.

1. Use the scatterplot of males’ life satisfaction scores verses females’ life satisfaction scores to describe
the relationship.
2. What is the estimated correlation coefficient? Interpret this value.
3. If we examined only those countries with life satisfaction scores of above 7 for both genders, what
would happen to the correlation? And, discuss why would that happen to correlation?
4. Fit a linear regression model relating males life satisfaction scores to females life satisfaction scores.
That is, fit a straight line for predicting males life satisfaction scores’ from females life satisfaction scores.
What is the equation of the regression line?
5. What does the regression line tell us in the context of this study?
6. What does the slope of regression line mean in the context of this study?

7. Note that the slope of the line does not differ much from 1.00. What would a slope of 1.0 indicate about
the nature of the relationship? If we fitted a model with the slope fixed at 1.00, what prediction equation
would you expect to get? (Hint: Refer to the summary statistics described by males and females. Find the
mean life satisfaction scores for males and for females to answer this question).
8. Can we, at all, interpret the value for y-intercept in the regression equation? Justify your answer.

9. What is the standard deviation of residuals? Interpret this value in the context of this problem.
10. Use the plots of residuals to assess the overall adequacy of linear regression model fit to this data. State
the assumption(s) about the residuals that each of the constructed plot checks and determine whether the
assumption(s) is/are met.
11. In which country or countries do the male respondents have “somewhat unusually” low life satisfaction
scores in relation to the female respondents, according to the regression model? Moreover, In which
country or countries do the female respondents have “somewhat unusually” low life satisfaction scores in
relation to the female respondents, according to the regression model? Give the residual(s) to make and
justify your argument.
12. Give and interpret the R2 value in the context of this study.
7

Assessment of Data Analysis Project


Last Name of Student
1.
2.
3.

Part 1, Question: Point(s) Point(s) Received


1: Identify the cases in this study 1
2: Identify population of interest in this study 1
3: Identify the sample in this study 1
4: Identify population parameter(s) of interest 1
5: Identify variable(s) of interest in this study 2
6: Identify the purpose 1
7: Location (where) of this study 1
8: Time (when) of this study 1
9: Data collection (how) 1
Total 10
Part 2, Question: Point(s) Point(s) Received
1: Identify the response and explanatory variables 2
2: Interpretation of Side-by-side boxplots 4
3: Interpretation of boxplot of differences 4
Total 10
Part 3, Question: Point(s) Point(s) Received
1: Interpretation of scatterplot 1
2: Interpretation of correlation coefficient, r 1
3: Interpretation of restricting the range 1
4: Identifying the equation of the regression line 1
5: Interpretation of the regression equation 1
6: Interpretation of the estimated slope 1
7: Realization of fixing slope at 1 2
8: Interpretation of the estimated y-intercept 1
9: Interpretation of the standard deviation of residuals 1
10: Diagnostic check of residuals using plots 2
11: Detection of unusual residual values 2
12: Interpretation of the value of R-squared 1
Total 15
Submit StatCrunch outputs:
necessary modifications made to variable names 15
Total Points 50

Marked by TA:
Comment (if any):

S-ar putea să vă placă și