Statistics Lecture1 2.10.18

Cohort study-practical
Research Methods and Introduction to Statistics
By
Dr Saiful Islam, Medical Statistician, IoN

&
Dr Caroline Selai , Senior Lecturer , IoN
Date : 02.10.2018
Module
Cohort code : CLNE0007
study-practical
Module name: Research Methods and Introduction to Statistics
Introduction to statistics will cover: Critical appraisal will cover:
Why statistics? What is Evidence Based Medicine

How to conduct research / research (EBM) ?
process?
Why EBM?
Role of statistics in research.
How my research data looks like? Hierarchy of evidence.
Data presentation/display.
Data analysis using appropriate How to extract evidence you need?
statistical tests/methods
How to interpret / present statistical What is critical appraisal?
output? Methods of critical appraisal?
Lecture : 8 one hour lecture Consider a variety of published
Workshops : 2 (Repeated) research paper to make it clear
Revision Lectures : 2 how you could appraise them
Assessments: 1 hours unseen written critically?
exam (proposed).
Lecture : 8 one hour lecture
Exam date : 6th February 2019 at 11.30am
Total credits = 15 . Half in introduction to statistics & the other half for research
Overall module aim
• This module aim is to equip you to do your own research

independently by
• Understanding research process
• Critically appraising any research paper
• Understanding current research methods by critically appraising
some recent research
• Clearly knowing different statistical methods needed in common
research
• Presenting/Displaying your own research data
• Learning Statistical tests/methods needed for neuroscience
research
• Learn clearly at least one statistical software (we will use
STATA) aiming to analyze and interpret your own data.
Lesson plan
• Why learn statistics

• Research process
• Role of statistics in data analysis
• Research process
• Summary measure of the data
• Identifying outliers in your data
• Types of data
• Data management
Introduction to data analysis
Learning outcome
• At the end of today’s lecture and workshop, you should be

familiar with
• Importance of learning statistics
• How statistics is related to Neurology/Neuroscience
 Different ways of displaying and summarising data
 Identifying outliers in my data
 How I can manage my own research data
 Using STATA to carry out exploratory analysis and presentation of a
dataset (histogram, box plot, cumulative frequency)
Why learn statistics?

The reason you are here is because you have a inquiring mind!
• Does using a mobile phone increase risk of brain cancer?
• Is drinking the occasional glass of wine during pregnancy harmful to the baby?
• Why do women live longer than men?
• What are the potential health risks of climate change, and who will be most
affected?
• Is there a gene for Alzheimer’s?
• Will banning cigarette sales from vending machines reduce smoking rates in
children?
• Should all children be routinely offered the swine flu vaccination?
• To answer interesting questions, you need two things: data and an

explanation of those data
Other Reasons
In the MSc course:

Analysing MRI data
Analysing dementia / Alzheimer disease data
Preparing poster
Reading scientific paper
Conducting MSc project
Post-qualification:
Interview for PhD/Job
Doing PhD
Publishing paper
Leading judge who hanged himself after dementia diagnosis left wife a note
saying she had 'a life to live', inquest hears :Telegraph Reporters
7 June 2017 • 11:39am
Sir Nicholas, who has died aged 71, was England’s senior divorce court judge who had rare neurological
disease called fronto-temporal lobe dementia that had only recently been diagnosed.
Leading judge who hanged himself after dementia diagnosis left wife a note
saying she had 'a life to live', inquest hears :Telegraph Reporters
7 June 2017 • 11:39am
Fronto temporal dementia is one of the least common forms

of dementia and is sometimes called Pick's disease or
frontal lobe dementia, according to the Alzheimer's
Society.
It affects part of the brain connected to control behaviour and
emotions plus the understanding of words. Fronto
temporal dementia is caused when nerve cells in the
frontal and/or temporal lobes of the brain die and the
pathways that connect them change.
We might save this person’s life by early
detection of this rare neurological disease by
doing more in-depth research in this area.
Undiagnosed: mother-of-four Marina Fagan had a family history of aneurism. Her brain
disease went unrecognised for 13 days :
Evening Standard : Wednesday 15 June 2016 08:53
Marina Fagan, a 51-year-old mother of four, was discharged following a two-day stay at
Whipps Cross hospital, in Leytonstone, after investigations ruled out a brain haemorrhage.
She returned to A&E the same day as her headache persisted but was advised to get her
GP to refer her to an outpatient clinic. Her condition was finally diagnosed 11 days after she
was first admitted to hospital. She died six days later, on October 6, 2015.
So we need more research & more neurologists
Means that neuroscientist should involve in more research to

understand the underlying/persisting disease
process……………….
The research process
Statistical
thinking is
involved in all
these phases,
along with
substantive
scientific
knowledge.
Role of statistics in data analysis
• Data are the raw material of knowledge

• Scientists rely on data to provide empirical evidence to support and refine their
theories
• Governments, businesses, communities, hospitals, GP’s and individuals need data to
help inform decision-making and risk assessment
• Learning statistics will provide you with basic skills to read and
understand data
• Broadly speaking, statistics provides us with techniques for
– Summarising and presenting the information contained in a data set
– Handling and quantifying variation and uncertainty in the data, to help us
infer what they tell us about the underlying theory of interest
Statistics – the art of telling stories with numbers

Summary measure of any numerical data:
mean, median, mode and inter-quartile range (IQR)
Mean, Median, Mode , range and IQR
Example: Patient ages (ordered)
24 32 37 39 40 41 41 43 44
25th value 75th value
Mean = add all values ÷ how many are they = ?
Median (middle value) = ?

Inter-quartile range = 25th value – 75th value =
Range = smallest value – largest value = ?
Mode: the number occurs repeatedly which is ……..
Variability within data – Variance and standard deviations
11
9 10
12
8 10
Standard deviations (Std. Dev.) = √ (variance)

Use statistical software STATA
We are in the age of technology so use statistical software
STATA –. Type data in STATA , give the variable name ‘Age’
Type following command in STATA in command line:
summarize Age
Output is
Variable Obs Mean Std. Dev. Min Max
Age 8 37.625 6.674846 24 44

But you should know what is Mean ,Std. Dev . & all others.
Use statistical software STATA
Type following command in STATA in command line to get
more information (quartiles, median etc…):
summarize Age, detail
Output is Age of patients
Percentiles Smallest
1% 24 24
5% 24 32
10% 24 37 Obs 8
25% 34.5 39 Sum of Wgt. 8
50% 40 Mean 37.625

Largest Std. Dev. 6.674846
75% 42 41
90% 44 41 Variance 44.55357
95% 44 43 Skewness -1.136833
99% 44 44 Kurtosis 3.142978
Mean < median
Measure from graph
No symmetry in the data and it

No symmetry in the data and
looks like negatively skewed
it looks like positively skewed
So mean and standard
So mean and standard
deviation is not appropriate
deviation is not appropriate
measure , median and inter-
measure , median and inter-
quartile range
quartile range
Normal distribution tail extended

equally over both sides so mean
and standard deviations are
appropriate measure.
“A picture is worth a thousand words”
• Graphical presentation of data enables us to get

a feel for:
– typical (central) values and range of values
– shape and spread of the distribution of values
– interesting patterns and relationships in the data
– ……..
• Graphical displays also help reveal problems with

data quality, e.g.:
– outlying / erroneous observations
– digit preference
– ……..
Displaying Data
Several possible methods:

• Tables
– Frequency Tables
– Cross tabulations (contingency tables)
– …...
• Graphs
– Bar Charts
– Histograms
– Line Graphs
– ……
Displaying Data
• Before embarking on formal statistical analysis of a

dataset, it is essential to carry out some simple
exploratory analyses to get a feel for the data
Example: Normal and day case hospital admissions in England
with a neurological condition.
Data stored in Moodle named HospAdmNeu.dta

Histogram of ordinary hospital admissions with a neurological condition
Histogram of the 2012/13 ordinary hospital admissions with a neurological condition among England CCGs
40
30
Frequency
20
10
0
0 5,000 10,000 15,000

Ordinary hospital admissions
Histograms: Number of Classes
• Too few classes and it could be difficult to see any interesting

patterns.
• Too many classes and you will end up with only one
observation per class.
• Aim is to ensure that the number of classes does not mask
interesting patterns
– Rule of thumb: optimal number of classes is approximately log
(base 2) of the number of observations
Number of obs Approx. number of classes
50 5-6
100 6-7
1000 10
10000 13
– Number of classes also depends on choosing ‘nice’ cutpoints

Box plot of ordinary hospital admissions with a neurological condition
Boxplot for ordinary hospital admissions in England CCG's in 2012/13

15,000
10,000
5,000
The box indicates that the median and two quartiles (1st quartiles = 2269, median= 2895 and
3rd quartile = 4013) . The vertical lines above and below the box indicate the range of values,
with outliers shown as separate points.
Cumulative Frequency Graph of the ordinary hospital admissions 12/13

800000
600000
400000
200000
0 5,000 10,000 15,000

Ordinary hospital admissions
Identifying outliers in your data
• Outliers are identified by assessing whether or not they fall within a set
of numerical boundaries called "inner fences" and "outer fences".
• A point that falls outside the data set's inner fences is classified as a
minor outlier, while one that falls outside the outer fences is classified
as a major outlier.
• Multiplying inter-quartile range (Q3-Q1) by 1.5 then add this number to
Q3 and subtract it from Q1 to find the boundaries of the inner fences.
• Multiplying inter-quartile range (Q3-Q1) by 3 (instead of 1.5) then add
this number to Q3 and subtract it from Q1 to find the upper and lower
boundaries of the outer fences.
• A point that falls outside the data set's inner fences is classified as a
minor outlier, while one that falls outside the outer fences is classified
as a major outlier.
Identifying outliers in your data-example hospital admissions
• Use hospital admissions data HospAdmNeu.dta

• Use summ Ordinary1213, det \\ to find 1st quartile & 3rd quartile
• IQR = Q3-Q1 = 4013-2269 = 1744,
• 1744 × 1.5 = 2616, 1744 × 3 = 5232
• Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6626, - 344)

• Boundaries for outer fence : (Q3+ 5232 , Q1- 5232 ) = (9245, - 2969)
• As hospital admissions never be negative we now check how many
data points are outside inner fence & how many are outside outer
fence using STATA :
• count if Ordinary1213 > 6626
15
• count if Ordinary1213 > 9245
4
As the data are positively skewed so report median and inter-quartile range.
Types of data
Quantitative
Continuous Discrete
Blood pressure Number of children (parity)
Age Number of cigarettes per day
Concentration of a pollutant Counts of deaths in small areas
Categorical
Ordinal Nominal
(Ordered categories) (Unordered categories)
Grade of breast cancer Sex (male/female)
Disease severity (mild/moderate/severe) Exposed/unexposed
Social class (I, II, III, IV, V) Ethnicity (white/asian/black/other)
Comments
• Categorical covariate data are often called factors

• Categorical data that take on only two distinct
values are said to be dichotomous or binary
• Categorical data are often coded using numerical
values (e.g. 0 = NO, 1 = YES)
– statistical packages usually treat numeric data as quantitative
unless you explicitly declare it to be categorical
• Limiting factor for any continuous observation is

the accuracy of the measurement instrument
Quantitative versus Categorical
• Sometimes we do not need all the amount of detail

provided by continuous data, in which case we can
transform into categorical (ordinal) data.
• For example, in a study of the effect of maternal smoking
on birthweight, we can recode birthweight as:
≥2.5kg 0 (normal bwt)
<2.5kg 1 (low bwt)
• In a study of the effect of air pollution on asthma

prevalence, we can recode ambient NO2 concentration as:
<30 mg m-3 LOW
30-60 mg m-3 MEDIUM
>60 mg m-3 HIGH
Transformations
• It is sometimes helpful to transform data to a different

scale, to aid interpretation and/or statistical analysis
• Reasons for transforming data include:
– improved approximation to normality
– reducing skewness
– linearising the relationship between 2 variables
– making multiplicative relationships additive
• Common transformations include:

– Natural logarithm (y = loge(x)  x = ey or exp(y), where e =
2.718…)
– Power transformations (y = x , y = x2 , y = x3 , etc.)
Log transformation
• Log transform stretches scale at
2
log(e)=1
lower end and compresses it at
1
upper end
y = log(x)
log(1)=0
0
• Can only take logs of positive

-1
values
-2
0 2 4 6 8 10
x
Histograms of CD4 counts in a sample of 537 AIDS patients

100 200 300
0 20 40 60 80
Number of patients
Number of patients
0
0 200 600 1000 0 1 2 3 4 5 6 7

CD4 count (per cubic mm) Log CD4 count (per cubic mm)
Class Exercise
Classify the following data as categorical
(Binary/nominal/ordinal) or numerical (discrete/continuous)
Variable Description Data Type

Age at diagnosis Age of patients at diagnosis of
cancer
Education 0=Primary, 1=Secondary,

3=Tertiary
Ethnicity 1=Black, 2=White, 3=Asian

Smoking status 0= Non-smoker, 1=Smoker
Derived variable
Percentage, Ratios, Can be treated as numerical in most analyses
Rates & Scores
Data display in a spreadsheet / Data management
Suppose you are running a study at UCLH aiming to lowers the low-density
lipoprotein (LDL) cholesterol levels for the patients with cardiovascular
disease. Your study is an RCT , double blind and placebo-controlled.
Patients were randomly assigned to receive evolocumab (either 140 mg
every 2 weeks or 420 mg monthly) or matching placebo as
subcutaneous injections. Out of first 20 patients
Group: 11 patients received evolocumab and 9 patients received placebo.
Gender: 12 female and 8 male.
Statin use: High intensity – 12 patients
Medium intensity – 6 patients
Low intensity – 2 patients
Using patient ID 1 to 20 and appropriate code display above information in
a spreadsheet. Ignore between variables information for now.
Data display in a spreadsheet - coding
Group: 1 if patients received evolocumab

0 if patients received placebo.
Gender: 1 if patient is female
0 if patients is male
Statin use: 2 for High intensity
1 for Medium intensity
0 for Low intensity
Possible data entry looks like below:

Data display in a spreadsheet – looks like -
Patient ID Group Gender Statin-use

1 1 0 0
2 1 1 2
3 1 1 2
4 1 1 1
5 1 0 2
6 1 1 2
7 1 1 2
8 1 0 2
9 1 1 1
10 1 0 2
11 1 1 2
12 0 0 0
13 0 1 1
14 0 0 2
15 0 1 2
16 0 1 2
17 0 0 1
18 0 1 1
19 0 0 1
20 0 1 2
Data display in a spreadsheet - coding
Consider the patients age between 50 and 70 with a mean age of 60 years.
Can you now put an extra column for age of the patients?
In your study you might get different variables but need to present in a similar
way!
Data display in a spreadsheet – type in extra column Age
Patient ID Group Gender Statin-use Age

1 1 0 0 56
2 1 1 2 52
3 1 1 2 59
4 1 1 1 60
5 1 0 2 63
6 1 1 2 70
7 1 1 2 63
8 1 0 2 58
9 1 1 1 55
10 1 0 2 59
11 1 1 2 68
12 0 0 0 59
13 0 1 1 67
14 0 0 2 69
15 0 1 2 52
16 0 1 2 53
17 0 0 1 61
18 0 1 1 63
19 0 0 1 62
20 0 1 2 51
Data display in a spreadsheet
Check twice that your coding is correct and make sure you
didn’t put any wrong information or typed any number wrongly
Check relevant research data matched your findings
Check other research proportion of the people using statin

and have lowered LDL. Is it consistent with yours?
Identify and develop methods how you handle missing
values
If you convince – data is ready to cook (for analysis).

Recap
• Need to distinguish between different types of data

(continuous, discrete, categorical)
• Most appropriate way of presenting data depends on data type
• Frequency tables are appropriate for all types of data
– For quantitative data, need to think carefully about appropriate choice of
classes/intervals to group data before display
– Keep information in tables to the minimum necessary to convey the
message (story) you want to present (significant figures, number of
variables/categories)
• Bar charts are appropriate for displaying categorical data

• Histograms and box plots are appropriate for quantitative data
Reference :
1. Introduction to medical statistics by Martin Bland : Chapter – 4
2. Medical Statistics by B. Kirkwood & J. Sterne : Chapter-4
3. Practical Statistics for medical research by Douglas Altman : Chapter 6

Statistics Lecture1 2.10.18

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Statistics Lecture1 2.10.18

Încărcat de

Drepturi de autor:

Formate disponibile

Cohort study-practical

Research Methods and Introduction to Statistics

Dr Saiful Islam, Medical Statistician, IoN

Why statistics? What is Evidence Based Medicine

• This module aim is to equip you to do your own research

• Why learn statistics

• At the end of today’s lecture and workshop, you should be

Why learn statistics?

• To answer interesting questions, you need two things: data and an

In the MSc course:

Fronto temporal dementia is one of the least common forms

Means that neuroscientist should involve in more research to

The research process

Role of statistics in data analysis

• Data are the raw material of knowledge

Statistics – the art of telling stories with numbers

Mean = add all values ÷ how many are they = ?

Median (middle value) = ?

Standard deviations (Std. Dev.) = √ (variance)

Variable Obs Mean Std. Dev. Min Max

Age 8 37.625 6.674846 24 44

50% 40 Mean 37.625

Measure from graph

No symmetry in the data and it

Normal distribution tail extended

“A picture is worth a thousand words”

• Graphical presentation of data enables us to get

• Graphical displays also help reveal problems with

Several possible methods:

• Before embarking on formal statistical analysis of a

Data stored in Moodle named HospAdmNeu.dta

Histogram of ordinary hospital admissions with a neurological condition

0 5,000 10,000 15,000

Histograms: Number of Classes

• Too few classes and it could be difficult to see any interesting

– Number of classes also depends on choosing ‘nice’ cutpoints

Box plot of ordinary hospital admissions with a neurological condition

Boxplot for ordinary hospital admissions in England CCG's in 2012/13

Cumulative Frequency Graph of the ordinary hospital admissions 12/13

0 5,000 10,000 15,000

• Use hospital admissions data HospAdmNeu.dta

• Boundaries for inner fence : (Q3+2616 , Q1-2616) = (6626, - 344)

• Categorical covariate data are often called factors

• Limiting factor for any continuous observation is

• Sometimes we do not need all the amount of detail

• In a study of the effect of air pollution on asthma

• It is sometimes helpful to transform data to a different

• Common transformations include:

• Can only take logs of positive

Histograms of CD4 counts in a sample of 537 AIDS patients

0 200 600 1000 0 1 2 3 4 5 6 7

Variable Description Data Type

Education 0=Primary, 1=Secondary,

Ethnicity 1=Black, 2=White, 3=Asian

Group: 1 if patients received evolocumab

Possible data entry looks like below:

Patient ID Group Gender Statin-use

Patient ID Group Gender Statin-use Age

Check other research proportion of the people using statin

If you convince – data is ready to cook (for analysis).

• Need to distinguish between different types of data

• Bar charts are appropriate for displaying categorical data

S-ar putea să vă placă și