Sunteți pe pagina 1din 47

UNSW Medicine

Phase 1, 2nd year, Semester 1


Research Skills Instructions
Authors:
Dr Rachel Thompson, Matt Begun, Suzanne Mobbs
Quality of Medical Practice Element

Acknowledgements:
Prof Deborah Black, Med program convenors, students and QMP tutors
IBM SPSS Statistics Version 24

Research Skills Instructions, March 2017 Page 1


Contents
Introduction to the Research Skills activities in Society and Health .................................................................3
Research Scenario ..............................................................................................................................................4
List of Dataset Variables ....................................................................................................................................5
Activity 1: Introduction to the data, cleaning and creating new variables (30 mins) .......................................8
Step 1. Opening the sample dataset ..............................................................................................................8
Step 2. The Output document .....................................................................................................................10
Step 3. Save your dataset and output now .................................................................................................11
Step 4. Understanding the data and variables in your dataset ...................................................................11
Step 5. Labelling and altering variables .......................................................................................................12
Step 6. Cleaning the Data.............................................................................................................................15
Step 7. Creating new variables ....................................................................................................................16
Appendix - Activity 1 QUESTION ANSWERS .................................................................................................17
Activity 2: Basic descriptive statistics (15 minutes) .........................................................................................18
Step 1. Running basic frequencies ...............................................................................................................18
Step 2: To select cases .................................................................................................................................19
Step 3. Split the file ......................................................................................................................................20
Activity 3: Examining Continuous/Integer Data for Normality (10 mins) ........................................................21
Step 1. Creating Histograms ........................................................................................................................21
Step 2. Checking distributions .....................................................................................................................22
Appendix - Activity 3 QUESTION ANSWERS .................................................................................................24
Activity 4: Examining categorical data (15 mins) .............................................................................................25
Step 1. Frequency tables for categorical data .............................................................................................25
Activity 5: Carrying out a Chi-square test (20 minutes) ...................................................................................27
Step 1. Creating a cross-tabulation ..............................................................................................................27
More on chi square test ...............................................................................................................................31
Activity 6: Carrying out a t-test (15 minutes) ..................................................................................................32
Step 1. Which t-test to use?.........................................................................................................................32
Step 2. Carrying out the test ........................................................................................................................33
Activity 7: Correlation and Regression (15 minutes) .......................................................................................36
Step 1. Plotting the data ..............................................................................................................................36
Step 2. Further analysis and interpretation .................................................................................................41
Appendix - Activity 7 ANSWERS ...................................................................................................................43
Instructions for the Research Skills Formative Analysis Submission ...............................................................45

Research Skills Instructions, March 2017 Page 2


Introduction to the Research Skills activities in Society and Health
Aims
During the three Research Skills practical activities in the Phase 1 Society & Health course you will:

gain confidence and experience in using SPSS for data handling and manipulation.
learn about data collection and data variables.
learn how to examine data effectively using basic statistical techniques.
gain skills in basic statistical analysis and interpretation in SPSS.
experience research teamwork.

The dataset
This is an SPSS dataset that we have simulated to reflect the demographics and risk factors for disease found in a
national survey published in 2012 in Indonesia 1. As this is simulated, there are no issues with multiple students
handling data across multiple devices and in multiple locations. In addition, we have been able to narrow down the
many variables and outcomes to make this a manageable and profitable learning experience.

Group work
From the first Research Skills practical in Society & Health you will need to work in your SH project groups. These will
also be your project groups in BGDB.

There will be 3 groups per scenario group, with each taking one of the 3 focus topics, so that all 3 topics are
represented in each scenario group:

1. IMMUNISATION childhood immunisation status


2. ACUTE RESPIRATORY INFECTION (ARI) childhood acute respiratory infection
3. DIARRHOEA childhood diarrhoeal infection

Assessment
Each group will be expected to submit a formative assessment via your SH course Moodle.
This is due directly following your groups third and final Research Skills practical in week 6 of SH.
This submission is based on the work done during these three practical sessions.
Your submission will form the basis of your BGDB child health group project with formative feedback being
provided by QMP tutors by week 1 of the BGDB course.

Key References

UCLA Academic Technical Services. Introduction to SPSS: Analysing Data examples:


http://stats.idre.ucla.edu/other/dae/

SPSS Statistics Essential Training from Lynda.com (requires login)


https://www.lynda.com/SPSS-tutorials/SPSS-Statistics-Essential-Training/182376-2.html?org=unsw.edu

For other references, see QMP Moodle, Resources for Research area.

1
Statistics Indonesia (Badan Pusat StatistikBPS), National Population and Family Planning Board (BKKBN), and Kementerian
Kesehatan (KemenkesMOH), and ICF International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia:
BPS, BKKBN, Kemenkes, and ICF International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-
reports.cfm
Research Skills Instructions, March 2017 Page 3
Research Scenario
Your Research Group
Your project group is made up of public health researchers who work for a local government unit in a semi-rural,
island of Indonesia. Its population is only 200,000. This is a very small local government unit compared to the largest
local government unit size of ~4 million people, and it is below the median size of these units across the nation
(median: ~280,000 people). The smallest unit has only ~12,000 people.

The Island
The island is quite poor but has reasonable school attendance and a good main hospital. However, there is some
very remote and rugged terrain on the island, with many small villages in the hills away from the sea.

What is your groups role in the local government organisation?


Your role is to assess local health in order to advice on the provision and planning of health care resources for the
local population

Your received this data from the group who published the Indonesia Demographic and Health Survey 2012 2.

The dataset consists of all 1032 local households who took part in the survey, who had at least one child living at
home with them. (Note, limits on parent ages were imposed as per the original survey which was looking specifically
into reproductive and family topics).

What are you going to do with this data?


Youve been asked by your local government health supervisor to analyse the dataset in order to find out more
about your theme (as you realise this is important for the health of the local children). You have a free rein so you
could look broadly at what major factors are influencing the status of your theme in the household children, or, you
could look more specifically at the impact of a particular risk/ protective factor, or even test out an hypothesis.

You hope to write this up as a journal article and have set your sights on the respected, international Taylor &
Francis publication: Paediatrics and International Child Health (formerly the Annals of Tropical Paediatrics), see:
(http://www.tandfonline.com/loi/ypch20).

Important notes re the data


Some of the data has been transformed for you by the original survey team (e.g. water quality and sanitation
variables, etc). You should check out the original survey info and data to see what was originally collected by the
researchers and how it has been altered for your data.

Also the survey team were able to collect extra information for you about the eldest child in each family (up to age
18 years). So you have data regarding the eldest childs 1 year immunisation status and also their status regarding
diarrhoeal disease and acute respiratory infection (ARI) (i.e. not just for 1 year olds/under 5 year olds respectively as
in the original survey).

In addition, the status of the rest of the children in the family for these conditions is included. For diarrhoea and ARI
status this is provided as YES/NO variables, with YES indicating that at least one other child in the family meets the
conditions for the disease. Another variable indicates the highest immunisation status among any other children in
the family.
>>> See the variable list on the next page.

2
Statistics Indonesia (Badan Pusat StatistikBPS), National Population and Family Planning Board (BKKBN), and Kementerian
Kesehatan (KemenkesMOH), and ICF International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia:
BPS, BKKBN, Kemenkes, and ICF International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-
reports.cfm
Research Skills Instructions, March 2017 Page 4
List of Dataset Variables
Note: This simulated dataset is based on data published in this report: Statistics Indonesia (Badan Pusat StatistikBPS),
National Population and Family Planning Board (BKKBN), and Kementerian Kesehatan (KemenkesMOH), and ICF
International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia: BPS, BKKBN, Kemenkes, and ICF
International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-reports.cfm

Variable Explanation Values


ID Household identification no. 9 digit number
(unique)

Head of Household Variables


HH.Sex Sex of head of household 1 = Male
2 = Female

HH.Age Current age (years) Range 18-55 years

HH.Marital.Status Current marital status 0 = Never Married


1 = Married
2 = Living together
3 = Divorced
4 = Widowed
HH.Education Highest level of education of 1 = No education
head of household 2 = Some primary
3 = Complete primary
4 = Some secondary
5 = Complete secondary
6 = More than secondary
HH.Occupation Current occupation of head of 1 = N/A
household 2 = Industrial worker
3 = Agriculture
4 = Sales and service
5 = Clerical
6 = Professional/Technical/Managerial
HH.Smoke.Cigs Cigarette smoking status of head 0 = Not smoking cigarettes
of household 1 = Smoking cigarettes
HH.Number.Cigs Cigarettes smoked per day by 1 = 1-2 cigarettes /day
head of household 2 = 3-5 cigarettes /day
3 = 6-9 cigarettes /day
4 = 10+ cigarettes /day
HH.Height Height of head of household (cm) Range

HH.Weight Weight of head of household (kg) Range

Spouse variables
Spouse.Sex Sex of spouse 1 = Male
2 = Female
Spouse.Age Age of spouse Range >16 years
Spouse.Education Highest level of education of 1 = No education
spouse 2 = Some primary
3 = Complete primary
4 = Some secondary
5 = Complete secondary
6 = More than secondary
Spouse.Occupation Current occupation of spouse 1 = N/A

Research Skills Instructions, March 2017 Page 5


Variable Explanation Values
2 = Industrial worker
3 = Agriculture
4 = Sales and service
5 = Clerical
6 = Professional/Technical/Managerial
Spouse.Smoke.Cigs Cigarette smoking status of 0 = Not smoking cigarettes
spouse 1 = Smoking cigarettes

Spouse.Number.Cigs Cigarettes smoked per day by 1 = 1-2 cigarettes /day


spouse 2 = 3-5 cigarettes /day
3 = 6-9 cigarettes /day
4 = 10+ cigarettes /day
Spouse.Height Height of spouse (cm) Range

Spouse.Weight Weight of spouse (kg) Range

Child Variables
Child1.Age Child 1s age (years) Range 0 - 18 years

Child1.Sex Eldest childs sex 1 = Male


2 = Female

Child1.Immunisation Eldest childs immunisation status 1 = No immunisation at all


at 12 months (for children > 1 2 = Partial immunisation
year old) 3 = All basic immunisations excl.
Hepatitis
4 = All basic immunisations
Child1.Diarrhoea Eldest child had diarrhoea in the 0 = No diarrhoea
two weeks preceding the survey 1 = Yes diarrhoea

Child1.ARI Eldest child had been ill with a 0 = No ARI


cough accompanied by short, 1 = Yes ARI
rapid breathing and difficulty
breathing as a result of a problem
in the chest, in the two weeks
preceding the survey
ChildOther.Immunisati Other children in the family - 1 = No immunisation at all
on highest immunisation status at 2 = Partial immunisation
12months old 3 = All basic immunisations excl.
Hepatitis
4 = All basic immunisations
ChildOther.Diarrhoea Other children in the family - any 0 = No diarrhoea
recent diarrhoea 1 = Yes diarrhoea

ChildOther.ARI Other children in the family - any 0 = No ARI


recent ARI 1 = Yes ARI

Research Skills Instructions, March 2017 Page 6


Household variables
Residence Living Environment, dependent 0 = Urban
on the classification of the 1 = Rural
village/ township where the
household lives.

Water.Source Source of drinking water and 0 = Not improved


whether it is suitable for drinking 1 = Improved
or not (WHO and UNICEF
standards applied).
Water.Treatment Household process for making 0 = Not treated
water safe, if not improved 1 = Treated
source.
Safe.Sanitation Sanitation system for the 0 = Not safe
household is judged to be 1 = Safe
improved i.e. safe if a proper
latrine and private (not shared
with other households).
Soap Based on observation of 0 = Not Available
availability of soap / detergent in 1= Available
the household for hand washing.
Cooking.Fuel Type of cooking fuel used mostly 0 = LPG/ Natural gas
in the household 1 = Wood
2 = Kerosene
3 = No food cooked at home
4 = Electricity
5 = Charcoal
6 = Other
Child.Number Number of total children in the Range 1-7
family

Research Skills Instructions, March 2017 Page 7


Activity 1: Introduction to the data, cleaning and creating new variables (30 mins)
Step 1. Opening the sample dataset
To open SPSS

On the lab PCs: To open IBM SPSS, go to the START button, select Programs, select IBM SPSS Statistics => IBM SPSS
Statistics 24 should open up but note that the launch can be slow at times. Just be patient.

With your laptops, unless you have SPSS installed (unlikely) you will need to use MyAccess to access a virtual version
of SPSS. You can do this anywhere so long as you have an internet connection. Instructions for setting up myAccess
on your devices are here: https://www.myaccess.unsw.edu.au/. The app that you will need to launch looks like this:

To open the dataset

Open the 2017 Phase 1_SimData.sav dataset. UNSW students please access this via the Research Skills area in
QMP Moodle module/ The Research Process.

This data is mostly cleaned in terms of the data variables that you will use in this class, but you will need to check,
clean and prepare the data a little as you go (the instructions will help you to do this).

Note:
You are going to learn the basics of data input, management and manipulation. When you are doing your ILP/Hons
project, you will be analysing real data and may have to create your own dataset structure and manage this during
your research. To do this well, you will need to check carefully that the data is clean by checking for missing and
invalid data. There are some great YouTube videos from good sources showing how to do this systematically and
more thoroughly than we can manage today.

Once SPSS has launched the dataset in a Data Editor viewer window, take a few minutes to just LOOK at the data in
both the Data and Variable views:

These views are accessible via 2 tabs situated along the bottom of the Data Editor window.
To switch quickly between the two views, you can double-click on the variable name at the top of the
column in Data view or the number to the left of the variable name in the Variable view.
Or use: Shortcuts - Control T (PC) or Command T (Mac)
The long way to change views is to go via the View menu at the top of the viewer.

Research Skills Instructions, March 2017 Page 8


Data View (may look different a Mac):

Variable view (on a Mac):

Research Skills Instructions, March 2017 Page 9


Step 2. The Output document

You might not have noticed yet, but another viewer window opened up when you opened up the dataset. This is an
Output document.

An Output document opens for each and every time that you open a dataset. Note: It gets complicated once you
have more than one dataset open you may need to check which Output document you want your outputs to go to
essentially, it will record to whichever one is most recently live at the time.

The Output is very useful as it will record almost everything you do to the dataset so it becomes a record of your
cleaning, editing, and analysis AND it also displays all of your analysis/ tests as well as the tables, charts and graphs
that you ask SPPS to carry out on the dataset.

On the left-hand side of the Output is a Log record of all your activities on the open dataset(s). You can scroll up and
down to find a particular analysis or process you did and click on it to navigate there in the right-hand window area.

If you right-click on the charts and graphs in the right-hand window, you open an editor window that allows you to
edit these figures, change the axes, labels, colours and so on.

You can also copy the main outputs in this document (using standard shortcuts or using right click and choosing
Copy) and paste these into other documents, such as Microsoft Word. This doesnt seem to be working as well in
v24 of SPSS, so you may have to screenshot tables or use copy/paste special.

You must save this Output document or you will lose all this useful information! See below for best practice saving.

The Output

Yours will look something like this:

Research Skills Instructions, March 2017 Page 10


SPSS TOP TIPS
1. Save both the dataset and the outputs carefully
2. Dont overwrite an original dataset
3. Keep the original file(s) separate to the one you are working on
4. Name your data files and your outputs carefully and dont lose them!!

Step 3. Save your dataset and output now

Save your dataset and output files carefully with sensible names in an appropriate, accessible but secure place.
SPSS can be a bit fussy about filenames so dont use any punctuation / character marks except _ and keep it
short.
We suggest that you rename your output files with a date and time or version, but be consistent so you can use
these to track what you have done and find your analysis when you need to write it up.
If you make major changes to a dataset you should save it with a different name (just in case of any IT calamities
or mistakes!)

N.B. If you get stuck ask a tutor if you are in a class, or use the SPSS Help which has a very useful online help
resource. Please post up a question in the QMP Moodle discussion board if you cant work it out at home.

Step 4. Understanding the data and variables in your dataset

>> Take a quick look at the PowerPoint on Data and Variables, then answer the following questions:

1. In the Data and Variable Views:

QUESTION: How many cases are there? (ANSWER in ACTIVITY 1 APPENDIX)

>> In Data View scroll down to the bottom of the dataset and see how many cases there are

QUESTION: How many variables are there in total? What do these show? (ANSWER in ACTIVITY 1 APPENDIX)

>> In Variable View scroll down if necessary to see how many variables there are the far left hand column shows
you this.

The main variables that we are using today are below. Have a look at them:
Residence
HH.Age
HH.Sex
HH.Marital.Status
HH.Weight
HH.Height

>> Find (in QMP Moodle) and open the document: 2017 SimData Variables list.xlsx. This lists and gives explanations
for each variable and their categories / values. You will need to refer to this and make notes in this as we go through
the cleaning and analysis.

Research Skills Instructions, March 2017 Page 11


>> In the Variable view, take a look at the different types of Measure that are possible. Do you understand what
these mean?

>> If you need to revise what you know about data types and variables, please see the Data and Variables
PowerPoint show (in the Child Health Project resources area in your QMP Moodle module, under The Research
Process).

>> Click on the Measure cell for the variable Residence. You should see this:

There are 3 types of Measure as shown in this image: Scale, Ordinal and Nominal.

QUESTION: What other properties of the variables can you see in the columns here? What do you think these are
for? (Answer in ACTIVITY 1 APPENDIX)

Step 5. Labelling and altering variables

You can add or change labels for the variables:

>> Click on Data=>Define variable properties. A window opens that window allows you to check all of the
properties of each variable separately or all in one go.

>> To change the labels in this window follow the steps in below. There is a quicker way to do this in the Variable
view (useful for small alterations) by just going into the relevant cells. We will show you how to do this later on.

Research Skills Instructions, March 2017 Page 12


EXAMPLE:

Lets use the Define Variable Properties window to check on the variable HH.Residence:

1. Open the Data=>Define Variable Properties window.


2. In the left hand column window select Residence and take it across to the right hand column by using the
arrow in between. Or, if you prefer, you can just move it across by picking it up and dropping it in the empty
space on the right.
3. Click on 'Continue' and another window appears that will allow you view and edit this variable, its properties
and its labels.
4. Click on the Label box and type in Residence
5. Under Values, click on the cell and type in each value. Important: Make sure you get these labels the right
way round or you could misread all your results! In the sample data we are using here: 0 = Urban and 1 =
Rural, so type in these labels for the two groups in this variable.
6. While we are here, lets check the correct Measure label is in place for this variable. You should be able to
see this in the window.

QUESTION: Is Residence a nominal, ordinal or scale variable? SEE ANSWER in ACTIVITY 1. APPENDIX

7. Change this as necessary to the correct Measure.


8. Notice that if you change a variable the tick box for change on the left will show a tick this was you can
help you to keep a track of your edits.
9. When you are finished and happy with your changes - you must hit the OK button or SPSS will NOT keep
your changes!
10. Once you have hit OK these changes are kept and the Define Variable Properties window disappears.

Research Skills Instructions, March 2017 Page 13


SPSS HOT TIPS on Define Variable Properties window:

1. The Data=>Define variable properties window can help you check whether the right measure
is provided for a variable, plus other properties are correct. You can also label coded data etc.
and view quantitative data.
2. You can look at more than one variable at a time. Hold control/ command or shift (as usual) to
select which variables and take them across from the left hand side and into the right hand
side. Now you can view them one by one!
3. However, if you do this you need to know that once you are finished and happy with your
changes, you must hit the OK button or SPSS will NOT keep your changes! If you hit OK these
changes are kept and can only be undone by re-editing.

>> Try out this process, for all of the variables that you are using today: Residence, HH.Age, HH.Sex,
HH.Marital.Status, HH.Weight, HH.Height. We have left some Labels and Values blank or in need of correcting so
check them carefully

Notes on HH.Marital.Status:

For HH.Marital.Status you will find that there are some invalid data: data values: 11, 22 and 5. These values
are not present in the code options for this survey variable. Hence they must be typos from the data input
that have not been spotted yet.
You can tidy these up in the Define Variable Properties window by checking the Missing box for these
variables
Another way to deal with this erroneous data is to find them in the Data View using the Sort function and
create them Missing one by one by right-clicking on the erroneous value and then selecting Clear from
the menu that comes up. This is not so easy if there are lots of erroneous values!
There are a few other inconsistencies in this variables data that you can seek out later if you have time and
are interested. Tip: look Spouse.Age data compared to HH.Age discuss.

>> Once youve finished checking the data and properties, and added in labels as appropriate for all the variables we
are using today, remember to click OK as you close the window. If you dont, your changes will not be saved!

>> Go back to the Variable view if you are in Data view. Take another look at your variables - can you see that the
value labels you have added are visible and accessible now in the Values column? Can you see any other changes
you have made?

>> The quick and easy way to add or change labels is (in Variable view) to highlight the relevant cell for the variable
in the Values column and click on the little square button (with 3 dots on) that appears in the right-hand side of
the cell. Click on this to open a window where you can add, remove or edit your variable value labels.

>> Try out this method to label the Spouse.Smokes.Cigs variable values. You can type in a label for it by activating
the cell and just typing it directly in. Remember also to check that the measurement label for this variable is correct
first (you can click on that cell to change it directly if you think it is incorrect).

Moving variables order in the dataset. You can also move variables up or down the list to display in a different order
in the Data View.

Research Skills Instructions, March 2017 Page 14


>> Move the Water.Treatment variable up so it is below Water.Source variable. It makes good sense to have it here.

>>To do this, just click on the far left numbered button (5) next to Water. Treatment so that the whole row is
highlighted, then hold down the click and move the button up to just below the button 3.

Check what this looks like in the Data view now.

Step 6. Cleaning the Data

In the Data View. Take a moment to look at the data can you see any immediate possible errors just by looking it?

>> Using the sort mode, in "Data=>Sort Cases", arrange the HH.Age variable in ascending order. Look at the data in
this variable column. Are there any odd data entries? The range expected is 18-55 years.

>> Do the same with the other variables we are using today.

Are there any odd data entries?


What do you want to do with these?
Discuss in your group and then with a tutor.

SPPS Hot Tips on cleaning and record-keeping:

1. Make a note of any changes that you make in the dataset as you need to be consistent. Use
the List of Variables Excel spreadsheet that we have provided you with to record new
variables and make notes of any manipulations/adjustments that you make.
2. Remember to keep one version that is totally original and save any changes you make
thereafter as renamed/dated versions.
3. Keeping a team research journal is a good idea and log all major changes and analyses in that.
4. You should also be prepared to mention how you cleaned the data in the
Methods section of your group project report (this will be prepared as a journal
publication).
5. Careful decisions need to be made if you REMOVE or CHANGE any data point(s). You cannot
guess what this should have been. If you arent sure, but its definitely wrong, then it must be
taken out of analyses by telling SPSS to see it as a missing value.

Research Skills Instructions, March 2017 Page 15


Step 7. Creating new variables

You obviously cant give labels to every value in a continuous variable. For example age in the age variables in this
set is continuous data. Labels are not useful here but we can create groups for continuous data. However, we could
create two age-groups for head of household (e.g. 18-34 = Under 35 yrs, and 35-55 = 35 yrs and over).

QUESTION: Have a think - what other variables could you create new groups for?

Also, for any categorical variable, you can amalgamate codes or create new variables from old ones.

To do this, use: Transform => Recode into same variable or Transform => Recode into different variable.

You can use this Transform/ Recode process to:

1. Change a code to a new code number (e.g. if 0 is male and 1 is female but you would rather they were 1 and
2 respectively
2. Convert a scaled (continuous) variable into a grouped variable (e.g. as shown below in the DEMO for a
different dataset that also has an AGE variable). In the DEMO below, there are only 2 age-groups (OLD and
YOUNG) formed but you could make many and use this for further analysis.
3. Change a variable into a new variable that only includes some of the data. For instance you could make a
new variable that only contains an older age group.

>> NOW VIEW THE DEMO BELOW FOR INSTRUCTIONS ON HOW TO DO THIS (note that this demo is using similar
but different variable data to the data you are using, but the principle is the same).

>> Then, carry out this transformation for our HH.Age variable: create a variable called HH.Age.Groups with two
categories: 18-34 = Under 35 yrs, and 35-55 = 35 yrs and over) and label these accordingly.

https://moodle.telt.unsw.edu.au/pluginfile.php/1361716/mod_resource/content/2/SPSSDtaHandling/Demo1.htm
View Demo
2:19 mins

SPSS HOT TIP

To recode into the same variable, choose the Recode into Same variable from the
Transform menu.
HOWEVER take care as this will overwrite the original variable!
Therefore, it is good practice to ALWAYS Recode into Different variable as shown in
the demo.

>> Check out your Variable and Data Views to find your new HH.Age.Groups variable

<< YOU ARE NOW READY TO MOVE ONTO THE NEXT ACTIVITY >>

Research Skills Instructions, March 2017 Page 16


Appendix - Activity 1 QUESTION ANSWERS

Step 4. Understanding the data and variables in your dataset

QUESTION: How many cases are there?

ANSWER: There are 1032 Households in this dataset.

QUESTION: How many variables are there in total? What do they show?

ANSWER: There are 33 variables. They are varied some seem to be categorical and some scale data.

QUESTION: What other properties of the variables can you see in the columns here? What do you think these are
for?

ANSWER:
The columns in the variable view are: Name, Type, Width, Decimals, Label, Values, Missing, Columns, Align,
Measure, Role

The properties italicised above are more about the way the data is presented in the Data View.

The properties bolded above the ones we will concern ourselves with today. They are important in how the
data can be analysed and displayed in the Output.

Step 5: Labelling and altering variables

QUESTION: Is Residence a nominal, ordinal or scale variable? SEE ANSWER in APPENDIX

ANSWER:
Residence is NOMINAL in this survey, as the respondents were categorised into living either in an urban or
rural setting.

There are only 2 categories so this data is also Dichotomous, but not represented here as a yes / no answer
(although it could be think about it!).

Step 6. Creating new variables

QUESTION: Have a think - what other variables could you create new groups for?

ANSWER:
Could do this for all the age/ height / weight variables in this dataset.

Could also do it for categorical variables e.g. for HH.Marital.Status, could collapse down the 5 categories
into 3 e.g. Never married / Married or Living together / Divorced or Widowed.

Research Skills Instructions, March 2017 Page 17


Activity 2: Basic descriptive statistics (15 minutes)
Step 1. Running basic frequencies

Now you are ready to examine the data in more detail. Running frequency distributions is a good way to see you
data and also to clean it.

>> Choose Analyze from the top menu then Descriptive statistics => Frequencies

>> Choose the SCALE variables, HH.Age, HH.Height, HH.Weight and use the arrow to move them from the left-hand
column to the right-hand column.

>> Click on the Statistics option at the right-hand side of the screen.

>> In the pop-up box that appears, tick: standard deviation, range, minimum, maximum, mean, median, mode and
skewness

>> Click on Continue for this window, and then OK in the Frequencies box. Your Output Viewer will now pop-up,
showing the long frequency tables for these 3 variables. Above this it displays a Statistics table with the following
information:

Research Skills Instructions, March 2017 Page 18


>> Take a good look at the data in the frequency output tables in your Output document are there any odd data?
What were the age-limits for the cases for this study? Are there any outliers? If there were, what would you do with
these?

>> Look at the Statistics table. Which of the following: mean, median and mode is most useful here? Why?

>> What would you do to analyse these variables for head of household male and female separately?

SPSS HOT TIPS


The SPSS program gives good tips on how to interpret the results in the Output Windows.
Access the Statistics Coach from the Help menu, then a Tutorial window appears that will
take you thru the interpretation of the results shown.

Step 2: To select cases

>> To limit an analysis to a specific group of a nominal variable or a particular value of a scale (continuous) variable,
use: Data => Select cases.

>> In the window that appears choose the condition that you wish to apply. For example, click on the If condition is
satisfied button and then click on the If button.

>> Another window appears called Select cases: if and you can then click on the variable(s) you are interested in
selecting cases from (e.g. HH.Sex). Take this variable across to the clear window.

Research Skills Instructions, March 2017 Page 19


>> Now you must type in the variable Value that you want SPSS to select for. If we want to just select Men, then
type =1 so that you have HH.Sex=1 . This will select only cases for HH.Sex = 1(Male) for the output of any analysis
etc that you now carry out. Therefore you MUST remember to remove this when you do not want it on or your
results will only be for the men in the study!

>> If you run the basic frequencies again now you will find a big difference in the output! There are only Head of
household Males analysed.

TO REMOVE THIS SELECTION:


>> Go back to the Data => Select cases window and highlight the Select All cases button, then click on OK.
Alternatively you can use the Reset button and click OK". Check what you have done by looking at the Data View.

>>Do this now so that you have all the data ready for use. Check that you have removed the select cases by looking
in your Output file and the Data View.

Step 3. Split the file

The other way to select particular groups to run your analyses by, is to split the file using: Data => Split File

>> In the Split File window, choose how you want your analysis to take place: by Compare groups (which will
compare them directly in the same tables, etc.) or by Organizing output by groups (which will display them
separately in the Output file). Maybe try both of these ways and see what the difference is.

Click on one of the buttons now and then take the HH.Sex variable across to the Groups Based on: window. Leave
the rest as it is (SPSS will have to sort your file by the grouping variable you have chosen) and click on OK.

>> If you run the basic frequencies again now you will find a big difference in the output! Try the whole process
again for the other way of organising the output.

>> Then REMOVE the Split File by either taking the variable back to the LH side or by clicking on Reset and OK.

<< YOU ARE NOW READY TO GO ONTO ACTIVITY 3 AND DO SOME MORE ANALYSIS USING HISTOGRAMS! >>

Research Skills Instructions, March 2017 Page 20


Activity 3: Examining Continuous/Integer Data for Normality (10 mins)
Step 1. Creating Histograms
Now you know how to split the file, we can go back to the beginning again and find out the averages for males and
females separately and at the same time see if they are Normally distributed:

>> Lets split the file again by HH.Sex. Data => Split file => Organise output by groups. Place HH.Sex in the Groups
based on:" window and click OK.

>> Choose Analyze from the top menu then Descriptive statistics and Frequencies.

>> Choose the original HH.Age variable and use the arrow to move this from the left-hand column to the right-hand
column.

>> Press the Statistics option and choose: standard deviation, range, minimum, maximum, mean, median, mode
and skewness.

>> Then press Continue and choose Charts from the menu at the right hand side.

>> Select the Histograms tick box and tick Show normal curve on histogram then Continue, then OK on the
original window.

>> In your Output viewer, you will find the frequencies and histograms for HH.Age presented separately for men and
women who are head of household.

Research Skills Instructions, March 2017 Page 21


The histogram for age of Male head of household:

QUESTION: What do the 2 charts (one male data, one female data) show you? Is there anything notable here? Why
are we looking at this variable separately?

Step 2. Checking distributions


What we can tell from these histograms, and the statistics (skewness and the central tendency stats of mean,
median and mode), is whether the continuous data in the variable HH.Age is distributed normally for male and
female head of households. We can also see the range and have a guess at the mean/ median/ mode. (note: These
statistics are presented in the top table called Statistics for each sex).

>> Check the skewness values for HH.Age for male and female are these values between -1 and +1? What do the
histograms and normal curves look like?

QUESTION: Do you think this data is normal distributed, close to it or far from it?

Research Skills Instructions, March 2017 Page 22


SPSS HOT TIP
Can you see that you could have a second split as well, e.g. to split the file by
HH.Age.Groups as well.
That way you could look at a third variable, e.g. height or marital status for the older and
younger groups by subgroups of men and women.

>> If you have time, why dont you try this out for HH.Height and HH.Weight for Male /Female HH.Sex and HH.Age.
Groups TIP: it is easier to view and compare the groups if you use the Compare groups option in the Split File
window.

QUESTION: What do you find?

<< CHECK THE ANSWERS TO THIS ACTIVITY IN THE APPENDIX BEFORE YOU MOVE ONTO LOOKING AT
CATEGORICAL DATA IN ACTIVITY 4 >>

HOWEVER, BEFORE YOU MOVE ON: Remove the Split File conditions (for HH.Sex and HH.Age.Groups) by going to
Data => Split File and clicking on the reset button. Or you can click on the variables in the small window and send
them back using the arrow on the left to the main list. Either of these methods will remove the split.

>> Check your data is without any conditions and move on to Activity 4.

Research Skills Instructions, March 2017 Page 23


Appendix - Activity 3 QUESTION ANSWERS

Step 1. Creating Histograms

QUESTION: What do the 2 charts (one male data, one female data) show you? Is there anything notable here? Why
are we looking at this variable separately?

ANSWER: The data looks similar in terms of the shape of the distribution (sort of normal distribution bell-
shaped curve) but there is more male data than female data, and the female data has a lower mean (36.2
years compared to 38.5 years for the men).

Step 2. Checking distributions

QUESTION: Do you think this data is normal distributed, close to it or far from it?

ANSWER: The skewness for the both men and women head of household age data is very low (<0.15) and
positive. The data distribution isnt a normal curve for either of these but it does mostly follow the curve and
look somewhat symmetrical and bell-shaped. You would be hard-pressed to say this was normally distributed
though.

QUESTION: What do you find?

ANSWER:

For height

The histograms are much closer to normal distribution curves than the age distributions were.
The mean height of men and women is very different (women are shorter)
There is a tiny difference in the mean between the younger and older of both sexes. With the younger
groups being slightly taller. How might you test to see if this difference is statistically significant?
(NEED a HINT Ask a tutor!)

For marital status

Women head of households were more likely to be divorced or widowed.


There were more men and women likely to be living together in the older age group than the
younger age group.
Younger men who were head of household were not divorced and were mostly married.

Research Skills Instructions, March 2017 Page 24


Activity 4: Examining categorical data (15 mins)
Step 1. Frequency tables for categorical data
>> Check you have removed all Split File and Select Cases.

>> Choose Analyze from the top menu then Descriptive statistics and Frequencies.

>> Remove any variables remaining from previously in the right-hand window. Also remove any Statistics choices in
that window by un-ticking them (we are not interested in these as our data is NOT continuous!).

>> Choose the variable HH.Sex and use the arrow to move it from the left-hand column to the right-hand window.

>> Choose Charts from the menu and then from Chart Type, select the button for Bar charts and for Chart Values,
choose either Frequencies or Percentages. Click on Continue and then OK in the original Frequencies window.

>> Your Output will appear again with more tables and charts, this time for HH.Sex. See screenshot on next page:

Research Skills Instructions, March 2017 Page 25


SPSS HOT TIPS
1. If you get a weird output, then you forgot to take off some condition and need to go
back to the beginning and read the instructions more carefully!
2. You can make these charts prettier by using the Chart Editor. Just double-click on the
chart you wish to edit. This opens the Chart Editor viewer. By clicking on an element
you wish to edit, you will open an edit window. Have a go at this change the colour
fill of the columns, try altering the axis labels. See what you can do!
3. Tables and graphics can be copied across into other software such as Word by right
clicking on the area to be copied then choosing from the popup menu (i.e. copy, copy
special, etc.). If this doesnt work, then take a screenshot.

>> Repeat this process for all the categorical variables in the dataset (be brave!) by taking all of them across to the
right-hand of the Frequencies window. SPSS will do them all at once. Take a look at your Output.

>> What does this show? Do you notice anything interesting? Discuss your findings in your group, then check with a
tutor that you are on the right track.

If you are uncertain about anything make sure you check it out with a tutor.

<< YOU ARE NOW READY TO MOVE ONTO THE INFERENTIAL TESTS IN ACTIVITIES 5-7 >>

Research Skills Instructions, March 2017 Page 26


Activity 5: Carrying out a Chi-square test (20 minutes)
Step 1. Creating a cross-tabulation
Let us assume that we wish to test whether there is an association between HH.Sex and our new young/ old age
groupings HH.Age.Groups.

>> Open up a crosstab window with the commands: Analyze => Descriptive Statistics => Crosstabs.

>> Select the variable for the Rows (always the variable you are interested in seeing if there is an effect on, i.e.
HH.Age.Groups) and move it across. Move HH.Sex into the Columns area (this variable is the grouping variable, i.e.
the variable you think might be causing an effect on the other one(s)).

>> Click on the Statistics option and then choose Chi-Square. Click Continue.

>> Choose the Cells... option in the original window and in the Cell Display window, select Observed and Expected
Counts and Row, Column and Total Percentages, then click OK.

Having all these results in the Output tables will help you to interpret the chi-square results:

SPSS will now show the numbers expected (Expected count) if there is no difference between the sex for
each HH.Age.Group category). In this way we can compare these values to the Observed (the actual values)
in the data.

In addition, you will be able to see the % across and down in the chi-square cross-tabulation that SPSS
creates for this analysis. This helps us to understand what the difference is, if there is one.

Research Skills Instructions, March 2017 Page 27


The output tables are shown on the next page. What do yours look like?

OK so I am sure that you are asking: What does this all mean?

This cross-tabulation displays the number of cases in each category defined by two or more grouping variables. So
here we have the number of men and women that are in each age-group, young or old.

We also have been given numbers in the cells that show what would be expected if there was no difference
between the two groups (this is the basis of the chi-square test).

Essentially, a chi-square test is used to test the hypothesis that the row and column variables in a cross-tabulation
are independent. A low p-value (significance value of conventionally below a probability of 0.05 or 5%) indicates that
there may be some relationship between the two variables i.e. they may be dependent.

However, while the chi-square test measures may indicate that there is a relationship between our two variables
here, they do not indicate the strength or direction of the relationship. We have to look at the values in the table to
work that out.

SPSS TOP TIPS

1. Usually we quote the Pearsons test result unless there are very small expected numbers
in the cells.
2. Chi-square test is a large-sample test, so when you have smaller sample sizes, a more exact
distribution instead of the chi square distribution is used with a method called Fishers
exact test instead of Pearsons.
3. So, you should take the p-value result for Fishers exact test when more than 20% of the
expected values in the crosstab table cells are < 5.
4. See below the Chi Square Tests table for an indicator of this (footnote a in the final table).

Research Skills Instructions, March 2017 Page 28


Research Skills Instructions, March 2017 Page 29
Results and Interpretation

So for these results above, the chi-square value is significant at 5% level of significance as indicated by the p
(Asymp. Sig. (2-sided)) = 0.001 in the Chi-Square Tests table.

>> Note: SPSS will report a value of p <0.0005 as 0.000). When reporting a very statistically significant result like
this, you should present this value rounded up to 0.0001 usually). You must not report it as 0.000, as in theory, zero
probability for p value in stats would mean that you are certain and we can never be that certain when using a
process of sampling!

So, this result shows that we have demonstrated with this test a significant statistical association between the age
group (young/ old) and sex of the head of household in this dataset.

>> Considering this result, how do we explain these findings? What is the association? We can view the data in two
ways as the data presented in a table we can read across the rows or down the columns.

>> To interpret if there is a difference in the constitution of the age groups we can look across the rows.

To find out if there is a difference between men and women in the young group or the old group, we should examine
the data within the age group variable (% Young/ Old groups) for male and female. This shows us that only 27.1%
of the older group of heads of household are women and most are men (72.9%). Considering younger heads of
households, there are more women (36.6%) but still more men (63.4%) than women are heads of households.

>> To interpret this another way, first look at the % within Head of household sex for each type of age group, this
gives you the % of the participants for this variable grouping who are men and women.

What you can see is that for the group of young heads of household there is a higher % of women than men who are
young group: 43.5% of all women are in this young group, whereas only 33.1% of all men are in the young group. The
opposite (not surprisingly 3) is seen in the older group where 66.9% of men are in the older group, whereas only
56.5% of women are older.

3
Remember: you are looking at the data in a 2x2 table these proportions are all linked and can be used to explain
carefully what is present in the data!!
Research Skills Instructions, March 2017 Page 30
QUESTION: So - what does this actually suggest for our sample data?

ANSWER: What does this result actually mean?

Well, it shows that for the participants in our sample, you have found that there was a statistically significant
association between age group (young/old) and sex of the head of household with a suggestion (from examining the
figures in the cross-tabulation) that more women who are heads of households are younger than men in this
situation.

We can speculate why it may be that women are more likely to be young heads of households, or we can flip it
around and speculate why men might be likely to be older heads of households.

Take another look at your outputs for marital status frequencies and bar charts by split file of HH.Sex and
HH.Age.groups. This might give you some other ideas for possible reasons for this difference.

More on chi square test

The degrees of freedom for a 2x2 Chi-square test is (df)=(rows-1)*(columns-1)=(2-1)*(2-1)=1 (this is given in the Chi-
square results table above).

Chi-square can get rather confusingFor more information:

The QMP Online Tutorial 8 on Chi-square tests gives this explanation in full detail.

Also, there is a detailed explanation on how to work out Odds ratio and confidence intervals for these results, in a
PowerPoint show in the group project area in the QMP Moodle.

<<IF YOU UNDERSTAND ALL OF THIS, YOU ARE NOW READY TO MOVE ONTO DOING A T-TEST IN ACTIVITY 6>>

Research Skills Instructions, March 2017 Page 31


Activity 6: Carrying out a t-test (15 minutes)
Step 1. Which t-test to use?

The Student's independent t-test can be used to test whether there is a statistically significant difference in the
population means for two independent groups.

>> By independent t-test we mean that we could run an unpaired t-test with a grouping variable (of HH.Sex), who
values are separate and independent and we could examine these to see if their means heights were different.

By independent and unpaired we mean that we have not recruited one sample and measured each individual twice
for height before and after an intervention at different ages of young and then later again as old. If we had this
would be paired data and we would need to use a paired t-test to analyse the data for this analysis. In looking at
mens and womens height we are examining totally different and independent data!

>> In health and medicine, statistical significance is assumed when a 'p-value' is less than 5%. A 'p-value' (in the
case of a t-test) is the probability that a conclusion is reached that there is a difference in the population means for
the two groups when, in fact, there is no difference i.e. the difference observed occurred by chance.

WHAT ARE WE TESTING?

>> Let us use a t-test to see if there is a statistically significant difference in the population means of the variable
HH.Height for the Male and Female Heads of household in the dataset HH.Sex. This is a sensible research question,
and we have good reason to believe that height is considerably affected by gender.

>> We should first check to see if height is normally distributed by checking the histogram with a normal curve and
the statistics skewness and central tendency values. If you havent done this yet, do it for HH.Height for each
HH.Sex value (Male and Female) separately.

Our null hypothesis (H0) is that there is no difference in the mean height of male and female heads of household in
the population from which our dataset came from.

Our alternative hypothesis (H1) is that there is a difference in the mean height of male and female heads of
household in the population from which our dataset came from.

ONE-SAMPLE OR TWO-SAMPLE TEST?

>> We have 2 separate, unpaired groups of males and females so will be using the independent t-test, which is in
effect a two-sample t-test.

Note: There is also a one-sample t-test that can compare a sample mean to a standard (e.g. a known population
value) or a hypothesised value.

TWO-TAILED OR ONE-TAILED TEST?

>> We might guess that the difference is that males are taller in fact. This makes a lot of sense, as there is evidence
to show that in most populations, the males are taller than females on the whole (do you know why?).

Research Skills Instructions, March 2017 Page 32


However, we might be surprised to find that females are taller! We basically dont really know for certain and cant
presume. Hence we use a two-tailed test (SPSS sets this as default in the test windows we use to choose the t-test
below see if you can find where it is!).

>> On occasions in science and medicine, we are certain that the difference is only active in one direction we could
use a one-tailed test. This is most commonly used in biological experiments where only one direction is physically/
physiologically possible.

So, we only use one-tailed test in controlled experimental circumstances where we are certain of the direction of
the difference under examination. An example of a one-tailed, one-sample t-test would be where we are testing a
group sample mean to see if it higher than a standard value e.g. testing exam results for a Phase 1 cohort (say a
mean of 68.2%) in relation to the pass mark of 50%).

Step 2. Carrying out the test

>> It is pretty simple to carry this out in SPSS but you do need to take care to have cleaned your data and to choose
the correct variables for your question.

>> Our question is the alternative hypothesis above. However, we know that height is affected by age as well - so we
could split the file by our young-old age groups as well. Maybe try this afterwards and see if it makes a difference.

>> You will find the Independent-Samples T Test in the Analyze => Compare Means option.

>> In the t-test, you define the test variable as HH.Height and the grouping variable as HH.Sex. It is necessary to
state the values of the grouping variable by defining the two possible values to be used in the t-test. In this case it is
1 = Male, 2 = Female.

>> Leave the rest of the options as the default (take a look and see what these are, but dont touch anything).

>> The results for this analysis are on the next page (remember your final data results may look slightly different if
you have been playing with the dataset).

You will note in the top table in the SPSS output that the men have a higher mean height than women in this dataset
can you see by how much taller they are on average? Now we need to see what the test results are:

Research Skills Instructions, March 2017 Page 33


QUESTION 1:
What does Levenes Test for Equality of Variances and Equal variances mean and how do I choose which row of
the results to quote as the result?
ANSWER re Levenes Test:

Levenes is corrective test is applied automatically by SPSS to correct for the effect that unequal variances (spread of
data distribution) between our two groups (Male and Female) might have.

This is just like other statistical tests of difference that you have come across so far. The F test value gives a p value
(Sig) for our data. If the significance value for Levenes test is high (conventionally greater that 0.05) we use the
results that assume equal variances for both groups (i.e. there is no difference in the variances between the groups).

If you remember, variance is a measure of the spread of the variable and if p >0.05, then that would mean that the
data variance for height in both male and female in this dataset is similar enough and we can use the top answers in
the table for the t-test result (i.e. the crude t-test results).

Why bother with Levenes test?


Variance differences between the groups of the grouping variable can influence our test results so that if there is a
significant difference between the groups using Levenes test (F gives Sig <0.05), then it is sensible to use the
correction Equal variances not assumed and these results should be quoted instead. Note the df also changes
(decreases) in these results as corrections have been applied to the analysis.

In the analysis of our data, Levene's test is NOT significant at the 5% level (F = 1.837, p = 0.176) i.e. p>0.05, so, we
read the results from the top (crude results) row, as we assume that the variances are equal (i.e. there is no
statistical difference in the variances of the groups of Male and Female Heads of households, for their height).

THE BIG QUESTION:


Did you notice that the sample size for our t-test above is large (total sample size >>> 60) - has SPSS done a t-test or
another test in fact?

ANSWER:
The t-test is a small-sample test useful if total sample size (both groups) is <60 and especially if <30. Hence SPSS has
not carried out a t-test but essentially a z-test for us as the statistics will be based on the normal distribution z values.

QUESTION 2:
So what is the t-test result for our question?

Take a look at the tables for yourself and work out the result and what this actually means for our male and female
groups before you take a look at the answer given below.

Research Skills Instructions, March 2017 Page 34


QUESTION 2 ANSWER:

The top table gives us the means of the heights for the 2 sexes: male = 170.65 cm, and female =158.10 cm.

The Independent Samples Test results in the second table show that is t = 14.854 (df = 1030*) with a statistical
significance of p<0.001**. This is a statistically significant result.

NOTE, though that the t statistic would be a lower value if we had needed to use the correction for unequal
variances. Sometimes this will cause our p value to increase enough to become not statistically significant at the 5%
level.

However, for our results here, we can conclude that the difference between the mean heights of Male and Female
Heads of households that we see (of 12.6 cm) is very likely to be a real difference.

This p value is <0.05 (remember that 5% our conventional level for accepting or rejecting the null hypothesis) so
we can reject our null hypothesis and accept our alternative hypothesis.

Therefore, in conclusion: there is a statistically significant difference in the population mean height of Male and
Female Heads of household with the men being on average 12.6 cm*** taller than the women (p<0.001).

FOOTNOTE EXPLANATIONS:

* Remember df = degrees of freedom. For an unpaired t-test = (n1+n2)-2. Note that if we had used the corrected t
value for unequal variances, the df is less indicating that further analyses have been carried out (for this test, it would
have been df = 643.8, if we had needed the correction).

** We cant quote p as 0.000 for this result (not a feasible probability!). This is in fact a value of <0.0005 that SPSS
has rounded down to 3 decimal places as .000. So, the convention is to express this very low number by rounding it
UP which is 0.001.

*** It is sensible to round figures (especially effect sizes - e.g. mean difference for t-tests) up or down to a
reasonable number of decimal places. Here the mean difference between the mens and womens heights (12.555cm)
is rounded up to 1 decimal place (12.6cm) firstly as the participants would have given their height in whole cm, and
secondly as these calculated means can be taken sensibly to one decimal place more accurate than this (i.e.to the
nearest 0.1cm).

<<IF YOU ARE OK WITH THIS TEST, MOVE ONTO THE LAST ACTIVITY: 7. CORRELATION & REGRESSION>>

Research Skills Instructions, March 2017 Page 35


Activity 7: Correlation and Regression (15 minutes)
Step 1. Plotting the data

Correlation and regression are useful methods for analysing relationships between two continuous variables.

Let us examine two continuous variables that we suspect might have a relationship: height and weight.

Firstly, does height predict weight, or the other way around? This is important, as we want to use regression to
predict one variable with the other.

In fact, if you think about it, height does predict weight. We will use our dataset variables of HH.Height and
HH.Weight. However, before we go any further we need to consider that these variables contain data for both men
and women heads of households. An interesting phenomenon of height and weight are that they are considerably
affected by the sex (and age) of a person.

>> So, before we start, we should first Split the File by HH.Sex => Organise output by groups.

>> We should examine the data first by graphing to see whether there appears to be a linear relationship between
x and y by carrying out a scatterplot.

By convention, we must remember to put HH.Height as the independent variable (x) and HH.Weight as the
dependent variable (y). Do you know why this is? Discuss as a group and check with a tutor.

>> Check you have the right limits on the dataset (i.e. for HH.Sex Split File).

>> Choose Graphs from the top menu in SPSS, choose Legacy Dialogs then Scatter/Dot...

>> Choose the Simple scatterplot as shown below. Then place HH.Weight on the y-axis and HH.Height on the x-
axis and click OK.

Research Skills Instructions, March 2017 Page 36


The scatterplots will appear in your Output viewer one for Male (shown below) and one for Female.

The scatterplot for women looks less dense than the mens one. It could just be due to the smaller sample size for
women, or it could be that there is more variation in the relationship for height and weight for women.

Research Skills Instructions, March 2017 Page 37


SPSS TOP TIPS

1. On looking at this graph you can see lots of OUTLIERS data that are unusual with
respect to the other data as they are higher or lower in value. How many of these are
mistakes in measurement or errors in data entry do you think?
2. See how you can use this graph to identify possible problems with your data by showing
up these outliers. You should plot all the continuous data variables against each other
to take a look at this before you start the proper analysis. Then you can remove or edit
the mistakes.
3. N.B. remember to make a note if you change anything. In writing up, researchers have
to show how they cleaned up the data as this could obviously have a major effect on
results.

Looking at the graph above, there appears to be some positive correlation between height and weight for the male
group.

Can you estimate where a line of best fit might run? Use your finger to show where it might be.

Research Skills Instructions, March 2017 Page 38


In fact, you can actually add one in with SPSS:

>> In the Output viewer, double-click on the chart and this will open up a Chart Editor window.

>>Along the top are some chart-like icons find the one that will Add a fit line at total and click on it. You should
get a line like this one with a box showing the equation for the line (dont worry about confident intervals at the
moment):

QUESTIONS:
Does this line have a positive slope or negative slope?

Is it a good fit?

How far are the data-points from it?

Research Skills Instructions, March 2017 Page 39


ANSWERS:

Does this line have a positive slope or negative slope?

It is hard to tell just by looking at this scatter plot, however, we were right about the POSITIVE correlation
the slope of this line is positive.

Is it a good fit?

To find out this, we need to test this correlation and carry out a linear regression analysis.

>> Choose Analyze => Regression => Linear from the top menu in SPSS.

>> Choose HH.Weight as the dependent variable and HH.Height as the independent variable. Leave the rest of the
properties at the default settings you can have a look in Statistics etc. but dont change anything.

Research Skills Instructions, March 2017 Page 40


Step 2. Further analysis and interpretation

The important output tables for Head of household = Male are displayed below.

The correlation coefficient (see Table: Model Summary)

r = the correlation coefficient, an expression of the correlation between the observed and predicted values of
the dependent variable.

The values of r for the model produced by the regression procedure (here a simple straight line) range from 0 to
1. A larger value of r would indicate a stronger relationship between the two variables that you are looking at
here.

r squared = the proportion of variation in the dependent variable explained by the regression model.

r squared values range from 0 to 1. A small value for r squared suggests that the model (in this case the model is
linear) does not fit well.

QUESTION: What do these results mean?

Research Skills Instructions, March 2017 Page 41


ANSWER:
From our results (Table: Model Summary), it can be observed that r = 0.239 and r2 = 0.057.

This means that 6% of the variation in weight is explained by height for male heads of household in our dataset.

But there is other information here also what does the information in the other tables mean?

The straight-line equation

Also, as the coefficients show us how strong the relationship is, and we know that the usual equation for a
straight line is: y = a + bx, then, from the coefficients table above we can fill in our equation constants.

a is the equation constant (where the line would cross zero on the x-axis) and the dependent variable constant
is b (which is the slope of the line which describes the relationship between the independent (HH.Height x) and
dependent variable (HH.Weight y).

Coefficient b gives us an estimate for how much weight changes for each unit of x from the equation for a
straight line: y = a + bx).

QUESTION A: Work out what these results mean for the output that you have obtained for HH.HEIGHT and
HH.WEIGHT for the dataset.

QUESTION B: So, do you think that height and weight are related for the heads of the household (for Male and
Female separately)? SEE ANSWERS IN APPENDIX BELOW

>>See the APPENDIX for section 7 below for a full explanation.

Some final SPSS HOT TIPS

1. You can add titles, subtitles etc. using the scatterplot Chart Editor window.
2. Interactive charts are possible and very exciting but not so useful for your purposes here.
3. Remember to stick to black and white graphs and tables. Colour is very pretty but journal
articles (and medical faculty submission instructions) stipulate black and white in general as
this is how they are printed.
4. Last, but not least: correlation does not necessarily mean causation.

Research Skills Instructions, March 2017 Page 42


Appendix - Activity 7 ANSWERS

QUESTION 1: Work out what these results mean for the output that you have obtained for HEIGHT and WEIGHT for
the dataset. See APPENDIX for a full explanation. What do these results mean for the output that you have obtained
for HH.HEIGHT and HH.WEIGHT for the dataset?

ANSWER:
Remember that the equation for a straight line is: y = a + bx.
From the coefficients table above we read the values of the Unstandardized B column to find the coefficients for
our line equation.

a = 27.039 (the (Constant) row) this is the equation constant where the line would cross the y-axis (when
x=0).
b = 0.213 (the Height (head of household) row) this is the gradient of the line, or how much y (weight)
changes per unit of change in x (height).

As the usual equation for a line is: y = a + bx, this gives us the line equation of y = 27.039 + 0.213x.

In other words, as height (x) increases by 1cm, weight (y) increases by 0.213kg.

Going back to the correlation coefficients, in general if r2 is larger then there is more likely to be a significant
relationship between the two variables. The significance is actually given in the Sig. column in the ANOVA table.
Here it is statistically significant at p<0.001 (given as .000 but we round up for this minute probability if you
remember).

In words, we can say that if the significance value of the F statistic (the regression analysis test statistic) is small
(smaller than a probability of 0.05 the p value convention) then the independent variable is said to do a good job
explaining the variation in the dependent variable.

Research Skills Instructions, March 2017 Page 43


QUESTION 2:
So do you think that height and weight are related for the heads of the household (for Male and Female)?

ANSWER:

Well it seems as if height and weight are quite strongly linked in this dataset for the men who are head of
households, and that this is highly statistically significant at the p<0.001.

However, only 6% of the variation in weight is explained by height (this was the r2 result). So there must be other
factors affecting the weight that we have not looked at (and may not be able to examine with this dataset even).

Remember also that although we might have shown some statistical significance, this does not PROVE that these 2
variables have a causal relationship. We can say that there appears to be a relationship but we have not proved that
one changing actually causes the other there may be an indirect link that we do not know about here.

>> Take a look at the output tables for the Female group. What does this show: different findings, or similar? Check
your answers with a tutor.

WELL DONE!

<< IF YOU HAVE COMPLETED ALL 7 ACTIVITIES, THEN YOU HAVE REACHED THE END OF THE BASIC
INSTRUCTIONS FOR LEARNING SPSS>>

Research Skills Instructions, March 2017 Page 44


Instructions for the Research Skills Formative Analysis Submission

Aim
Your group task in SH is to perform a formative data analysis that you will submit to SH Moodle during/directly
following your practical in Week 6.

Task description:
Submit your groups final formative analysis as a MS Word as shown below.
This formative submission should contain the basic descriptive statistics for the subjects in the dataset.
It should also contain a few key basic analyses of the demographic variables that you are interested in analyzing
for your BGDB group project.
You must demonstrate the following basic inferential analyses: at least one t-test, one chi-square test and one
correlation with simple linear regression.
It is recommended that your group uses the Research Skills Formative Assessment Proforma provided in QMP
Moodle.

Formative Assessment:
This submission will be marked formatively and returned to you electronically in week 1 of BGDB, with feedback
added directly to the Word document.
An unsatisfactory grade will be given if this assessment is not given in on time.
This will be useful and build towards your BGDB group project on Child Health.
Submission details will be provided in the Week 6 practical session.

Submission Format:
Your submission should be uploaded as a Microsoft Word document and include the following:

Group Details:
NAMES & STUDENT No. (State your names and student z numbers)
SG GROUP No. (State your SG group, e.g. B8, A6 etc.)
CHILDHOOD RESEARCH TOPIC (State the topic that you are analyzing: Immunisation, Acute Respiratory
Infection, Diarrhoea)

Provide an analysis as follows:

1. Demographic Analysis:
Include some basic descriptive statistics re the households and individuals in the dataset. Run basic
frequency tables and cross-tabs to find out numbers and % for sex and age-groups for all survey
participants. Include other descriptive statistics for other variables (categorical, ordinal and
continuous) that are interesting, e.g. marital status of head of household, head of household and
spouse: weight, height and BMI.
Comment on the averages, the skewness and normality of the continuous data.
You should produce a summary table to display this descriptive data as Table 1 and include some
bar charts and histograms (as appropriate for the data).

Research Skills Instructions, March 2017 Page 45


2. Basic Inferential Tests:
We cover this during the Research Skills practical sessions 2 & 3.
See the Research Skills Instructions.
Provide one each of the following tests:
1. Chi-square analysis (include odds ratio)
2. T-test (paired or independent)
3. Correlation (with scatter plot) and simple regression analysis.

For each of these 3 tests provide the following information:

1. The research question or hypotheses null and alternative. What are you looking for/testing for?
2. Which variables are being used and for which data? (i.e. are you using a split file and looking at
gender separately? etc.)
3. Which test you are using and why is this appropriate for this hypothesis and the data?
4. Provide the relevant test results as OUTPUT tables (and graphs if appropriate). You can copy and
paste or screenshot these into the word document from your SPSS Output file.
5. Provide a summary of the key findings (extract the information from the output tables). Include the
figures for the following:
main outcomes
the test statistic
the degrees of freedom value for the test
the p value and confidence intervals
odds ratio if relevant/ available (2x2 chi square test). Ask a tutor how to do this.
6. Write a brief interpretation of the results in light of your research question or null/alternate
hypothesis. Do you reject the null hypothesis or accept it? Why?
7. Finally Write a full formal sentence that encapsulates all of the information necessary to transmit
each of your test findings - as you might find it written in a journal article.

Further advice re carrying out and interpreting the tests:

Chi-square analysis
Make sure that your variables are in rows/ columns correctly best to have the variable you are
interested in comparing for a particular variable in the columns.
State the full results and try to interpret the odds ratio (not the risk ratio as this is a survey this is cross-
sectional so we use odds). To do this: tick the Risk box when choosing the chi-square test in the
statistics window.

T-test
Is this a paired or independent t-test? Why?
Dont forget to quote the actual outcome which here is a difference between the mean values!
For Levenes test, state what the results are and what this means. Then provide the actual t-test results.
Mostly in journal reports, Levenes test is not mentioned usually as this is just an accepted part of the
analysis, so you will not be reporting it in your final report in BGDB. However here we ask you to include
it, as it will help us to see that you understand what this is showing!
You can provide a p value for the t-test but also a confidence interval. Quote this carefully.

Research Skills Instructions, March 2017 Page 46


Correlation and regression analysis
Take care with which is the dependent variable, e.g. weight (y-axis), and which is independent, e.g.
height (x-axis).
Provide the scatter plot for the variables you have chosen. Include the best-fit line.
It is important to read the output tables carefully. You should report on:
Line of best fit:
r (the correlation coefficient) and r2.
Noting the slope of the correlation coefficient r: is it positive a positive correlation? Or
negative a negative correlation?
How large is r2 and the significance (Sig) in the ANOVA table. This gives the significance of the
relationship between the 2 variables under examination: if the significance value of the F
statistic is small (smaller than 0.05 the convention) then the independent variables do a good
job explaining the variation in the dependent variable.
Describing the line using an equation:
The relationship between the variables using the constants in the Coefficients output table to
write a formula for the regression line:
y = a +bx, where a = (Constant) and b = the independent variable in the Unstandardized B
column of the Coefficients table.
b gives an estimate for how much the dependent variable (y, e.g. weight) changes for each unit
of the independent variable (x, e.g. height). Explain what this means for dependent variable y in
terms of a 1 unit increase in the independent variable x.

NB: remember that correlation does not necessarily mean causation!

REMEMBER GOOD TEAMWORK MAKES GOOD RESEARCH:

It is a good idea if the whole group takes part in the analysis as you will learn more about the data this
way AND be preparing for your ILP!
The biggest problem detected in poor projects (P-) each year is lack of understanding of the data.
Separating the work up so individual members miss out on the practical work creates misunderstandings re
the data and your research questions.
The best groups work together through ALL the practical sessions together.

Research Skills Instructions, March 2017 Page 47

S-ar putea să vă placă și