Sunteți pe pagina 1din 92

Random Variables

Topic 04 ST1232 Statistics for Life Sciences 1 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 2 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 3 / 92


Random Variables

Definition 1 (Random Variables)


A random variable is a numerical measurement of the outcome of a experiment.
Quite often, the randomness arises from the use of random sampling or a
randomised experiment to gather the data.

The values that a random variable takes correspond to outcomes of the


experiment.

If we sample NUS students and measure their heights, we do not know


beforehand precisely what values we will get. The random variable is height.

Topic 04 ST1232 Statistics for Life Sciences 4 / 92


Examples of Random Variables

Example 1 (Rolling Two Example 2 (Flipping A


Dice) Coin)
Suppose we roll two dice and Suppose we flip a coin.
take the sum.
S Z
S X H → 0
(1,1) → 2 T → 1
(1,2) → 3
(2,1) → 3
(1,3) → 4
(2,2) → 4
···
(6,6) → 12

Topic 04 ST1232 Statistics for Life Sciences 5 / 92


Random Variables and Events

The values that a random variable takes are on defined events on the sample
space S.

For instance, in Example 1, X = 2 corresponds to the event {(1, 1)} and


X = 3 corresponds to the event {(1, 2), (2, 1)}.

It follows that we can now say things such as P(X = 1) and P(X = 2)
without ambiguity.

Each possible outcome in S has a specific probability of occurring.

The probability distribution of a random variable specifies its possible


values and their probabilities.

Topic 04 ST1232 Statistics for Life Sciences 6 / 92


Sampling From the Population

The probability distribution applies for


selecting a subject at random from a
population.
Recall that numerical summaries of the
population are called parameters.
Numerical summaries of probability
distributions are also called parameters
since they pertain to the population.
We shall learn about the following
summaries of probability distributions:
I mean or proportion,
I variance, sd and
I percentiles.

Topic 04 ST1232 Statistics for Life Sciences 7 / 92


How Will We Use Probability Distributions?

In a population, suppose we know the prevalence of a disease to be 0.00001.


Then, if we were running a clinic, we may not want to stock up on the
medication for this disease too much, especially if it is expensive.

Consider two populations - people who smoke and those who do not. If we
know that the probability of contracting a particular form of cancer is 0.8 for
smokers and 0.4 for non-smokers, then we can run a campaign to educate the
public in order to reduce the occurrence of this cancer.

The walking time (for me) from LT32 to Kent Ridge MRT is a constant 9.5
minutes. Taking the bus also has an average time of 9.5 minutes, but the
probability that it is longer than 15 minutes is 0.2. What should I do?

Topic 04 ST1232 Statistics for Life Sciences 8 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 9 / 92


Discrete Random Variables

Definition 2 (Discrete Random Variables)


A discrete random variable X takes on a set of separate values {0, 1, 2, 3, . . .}. Its
probability distribution assigns a probability px to each possible value of X .

For each possible x, the probability px is between 0 and 1.

The sum of the probabilities for all the possible values equals 1.

We typically use uppercase letters to denote the random variable, and lower
case to denote the values that it has taken on.

Topic 04 ST1232 Statistics for Life Sciences 10 / 92


Probability Distribution when Tossing Two Coins

Example 3 (Two Coin Tosses)


Suppose we toss a fair coin twice. Let X represent the total number of Heads that
turn up in the two tosses. Then px is given by:

Outcome(s) x px
(T , T ) 0 0.25
(H, T ), (T , H) 1 0.50
(H, H) 2 0.25

Topic 04 ST1232 Statistics for Life Sciences 11 / 92


Probability Distribution when Rolling Two Dice

Example 4 (Rolling Two Dice)


Suppose we roll two dice. Let Y represent the sum of the values on these two dice. Then
the probability distribution of Y is given by:

Outcome(s) y py
(1,1) 2 1/36
(1,2),(2,1) 3 2/36
(1,3),(2,2),(3,1) 4 3/36
(1,4),(2,3),(3,2),(4,1) 5 4/36
... ... ...
(5,6),(6,5) 11 2/36
(6,6) 12 1/36

Topic 04 ST1232 Statistics for Life Sciences 12 / 92


Using a Bar Plot To Visualise a Probability Distribution
For Discrete Random Variables

A bar plot uses a rectangle for each possible value that X can take on.
The width of each rectangle is identical, but the height is proportional to px .

Topic 04 ST1232 Statistics for Life Sciences 13 / 92


When An Infinite Number of Values Are Possible
Example 5 (Infinite Number of Outcomes)
Suppose that we we are interested in the number of dengue cases that arise in
Punggol in August 2014.
We are told this random variable Z , which can take on values 0, 1, 2, 3, etc.,
follows this probability distribution:

e −20 20z
pz =
z!
Visualising with a bar plot is still possible, but not for the full range of z:

Topic 04 ST1232 Statistics for Life Sciences 14 / 92


Utilising the Probability Distribution
In Example 3, what is the probability that we observe at least one Heads?
I P(X ≥ 1) = P(X = 1) + P(X = 2) = 0.50 + 0.25 = 0.75.
In Example 4, what is the probability that we observe a 2 or a 11 for the
sum?
I P({Y = 2} ∪ {Y = 11}) = P(Y = 2) + P(Y = 11) = 3/36 = 1/12.
In Example 5, what is the probability that we observe exactly two dengue
cases, given that there was at least one?

P({Z = 2} ∩ {Z ≥ 1})
P(Z = 2|Z ≥ 1) =
P(Z ≥ 1)
P(Z = 2)
=
P(Z ≥ 1)
P(Z = 2)
=
1 − P(Z < 1)
P(Z = 2)
= = 4.12 × 10−7
1 − P(Z = 0)

Topic 04 ST1232 Statistics for Life Sciences 15 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 16 / 92


Mean of a Discrete Random Variable

Definition 3 (Mean of Discrete Random Variable)


The mean of a discrete random variable X is denoted by the Greek letter µ, and
is defined to be X
µ= xpx
x

Think of it as the sum of


Probabilities multiplied by Possibilities.

µ is the average of the values in the sample space, weighted by their


probabilities. Values that are more likely (i.e. have higher probability) have
more weight.

Topic 04 ST1232 Statistics for Life Sciences 17 / 92


Computing The Mean

Example 6 (Computing the Mean)


Consider the following random variable T , which takes on only two possible
values (0 or 1):
t pt
0 0.27
1 0.73
The mean of T is
µ = 0 × 0.27 + 1 × 0.73 = 0.73
Notice that the mean of T is not one of the values in the sample space!

Topic 04 ST1232 Statistics for Life Sciences 18 / 92


Computing The Mean

Let us return to the example on the two-coin toss in Example 3, where X


represents the number of Heads that turn up.
In this case, the expectation is given by

µ = 0 × 0.25 + 1 × 0.50 + 2 × 0.25 = 1

Topic 04 ST1232 Statistics for Life Sciences 19 / 92


Mean Number of Goals Scored

Example 7 (Cristiano Ronaldo)


Suppose that the number of goals that Cristiano Ronaldo scores in a game is a
random variable R that follows this probability distribution:

r pr
0 0.43
1 0.30
2 0.10
3 0.10
4 0.07

The mean number of goals that he scores in a game is

µ = 0(0.43) + 1(0.30) + 2(0.10) + 3(0.10) + 4(0.07) = 1.08

Topic 04 ST1232 Statistics for Life Sciences 20 / 92


About The Mean

The mean of the probability distribution of a random variable X is also


referred to as the expected value of X , and written as E (X ).

It is not what we expect to see for a single observation.

If we obtain a large number of observations from a population that follows


this probability distribution, the sample mean of those observations would be
close to the mean of the probability distribution.

Topic 04 ST1232 Statistics for Life Sciences 21 / 92


Properties of The Mean
Recall of topic 01 (EDA) we considered the sample mean of linear
transformations of the observed data. The identical property holds when
considering the mean of the linear transformation of a random variable X :
I Let X be a random variable with E (X ) = µ. Let Y = bX + a be a linear
transformation of X , where b and a are known.
I Then E (Y ) = bE (X ) + a = bµ + a.
If a1 , a2 , an are known values, and X1 , X2 , . . . , Xn are n random variables
with means known to be µ1 , µ2 , . . . , µn , then the mean of the linear
combination of the n variables can be obtained as follows:

E (a1 X1 + a2 X2 + · · · + an Xn ) = a1 µ1 + a2 µ2 + · · · + an µn

In particular, if n random variables X1 , X2 , . . . , Xn are identically distributed,


then the mean of these variables (denoted by X̄ ) has the mean as the same
as the mean of each random variable:
n
1X
E (X̄ ) = Xi = µ
n
i=1

Topic 04 ST1232 Statistics for Life Sciences 22 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 23 / 92


Risk-Taking or Risk-Averse?

Example 8 (Sure Win Strategy Or Not?)


You have $1000 to invest. Consider the following two investment options
presented to you:
A sure win of $500.
A 0.50 chance of a gain of $1000 and a 0.50 chance of gaining nothing.

Example 9 (Sure Lose Strategy Or Not?)


You have $1000 to invest. Consider the following two investment options
presented to you:
A sure loss of $500.
A 0.50 chance of a loss of $1000 and a 0.50 chance of losing nothing.

Topic 04 ST1232 Statistics for Life Sciences 24 / 92


Expected Gain / Loss

In Example 8, the expected gain is $500 in both cases. So what is the


difference?

In Example 9, the expected loss is $500 in both cases. So what is the


difference?

The random variables in the second strategy in both cases involve more
variability, or risk.

Most people choose the sure gain strategy in the 1st scenario and the risky
strategy in the 2nd scenario.

Topic 04 ST1232 Statistics for Life Sciences 25 / 92


Variance of Discrete Random Variable

Definition 4 (Variance of Discrete Random Variable)


The variance of a discrete random variable X is denoted by the Greek letter σ 2 ,
and is defined to be X
σ2 = (x − µ)2 px
x

The standard deviation of a discrete random variable is σ.

σ measures the variability of a random variable from the mean.

When comparing two random variables, the one with the larger standard
deviation has more variability.

An equivalent expression to refer to the variance of a random variable X is


Var (X ).

Topic 04 ST1232 Statistics for Life Sciences 26 / 92


Computing σ 2

The variance σ 2 for the random variable R in Example 7 is

σ2 = 0.43(−1.08)2 + 0.30(1 − 1.08)2 + 0.10(2 − 1.08)2


+0.10(3 − 1.08)2 + 0.07(4 − 1.08)2 = 1.5536.

The standard deviation is given by σ = 1.246.

We will try to arrive at more intuition about σ when we discuss the Normal
distribution later, but for now, think of σ as the average deviation of a
random variable from it’s mean.

Topic 04 ST1232 Statistics for Life Sciences 27 / 92


Properties of The Variance
Recall of topic 01 (EDA), we considered the sample variance of linear
transformations of the observed data. The identical property holds when
considering the variance of the linear transformation of a random variable X :
I Let X be a random variable with E (X ) = µ and variance σ 2 . Let Y = bX + a
be a linear transformation of X , where b and a are known.
I Then Var (Y ) = b 2 Var (X ) = b 2 σ 2 .

If a1 , a2 , an are known values, and X1 , X2 , . . . , Xn are independent random


variables with respective variance σ12 , σ22 , . . . , σn2 , then the variance of the
linear combination of the n variables can be obtained as follows:

Var (a1 X1 + a2 X2 + · · · + an Xn ) = a12 σ12 + a22 σ22 + · · · + an2 σn2

In particular, if we take the variance of the mean of n independent and


identically distributed random variables,
n
1 X 2 σ2
Var (X̄ ) = σ =
n2 n
i=1

Topic 04 ST1232 Statistics for Life Sciences 28 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 29 / 92


Continuous Random Variables

Definition 5 (Continuous Random Variables)


A continuous random variable X has possible values that form an interval.
It’s probability distribution is specified by a curve that helps determine
probabilities of intervals. This curve is referred to as a probability density
function, or pdf.
Each interval will have probability between 0 and 1. This is the area under
the curve, above that interval.
The total area under the curve is equal to 1.

Topic 04 ST1232 Statistics for Life Sciences 30 / 92


Visualising a Probability Distribution: From Bar Plots to A
Curve
The general idea is this:
Remember how we could use a bar plot to represent the probability
distribution for a discrete random variable?
Now, we have so many possible values that the individual bars cannot be
separated from their “neighbours” and so we do not see their distinct borders.

Topic 04 ST1232 Statistics for Life Sciences 31 / 92


Systolic Blood Pressure
Example 10 (Systolic Blood Pressure of Males)
Consider randomly selecting a Singaporean male, aged 35 to 44 years old,
and measuring his Systolic Blood Pressure.
Let X be the random variable representing the outcome.
Suppose that the pdf of X is given by the curve below.

Topic 04 ST1232 Statistics for Life Sciences 32 / 92


Computing Probabilities of Intervals
If we were interested in the probability that the measurement of the
individual selected falls between 60 and 73 mm Hg, we would have to find the
following area under the curve:

This area is 0.232. Hence the required probability

P(60 ≤ X ≤ 73) = 0.232

Topic 04 ST1232 Statistics for Life Sciences 33 / 92


Computing Probabilities of Intervals
If we were interested in the probability that the measurement of the
individual selected is greater than 90 mm Hg, we would have to find the
following area under the curve:

This area is 0.202. Hence the required probability

P(X ≥ 90) = 0.202

Topic 04 ST1232 Statistics for Life Sciences 34 / 92


Bus Arrival Times
Example 11 (Bus Arrival Times)
Suppose that the arrival times between buses is a random variable Y with the
following pdf. What is the probability that you have to wait more than 3
minutes?
5 minutes pass without any bus. What is the probability that you have to
wait a further 3 minutes?

Topic 04 ST1232 Statistics for Life Sciences 35 / 92


Computing Probabilities of Intervals
The first question is asking for P(Y ≥ 3), which corresponds to the area
below.

This area is found to be 0.687. Hence

P(Y ≥ 3) = 0.687.

Topic 04 ST1232 Statistics for Life Sciences 36 / 92


Computing Probabilities of Intervals
For the second question, we have to work a little first to identify the intervals
whose probabilities we need.
P(Y ≥8∩Y ≥5)
P(Y ≥ 8|Y ≥ 5) = P(Y ≥5)
P(Y ≥8)
= P(Y ≥5)

The respective probabilities are 0.3679 and 0.5353. Hence the desired
probability is just
P(Y ≥ 8|Y ≥ 5) = 0.3679/0.5353 = 0.687
Topic 04 ST1232 Statistics for Life Sciences 37 / 92
Computing the Area Under a Curve

All the areas on the previous slides were computed using tables or software.

We will avoid using integration to obtain the area under a curve for this class.

In the Section 6 on the Normal distribution, we shall introduce the use of


tables.

Topic 04 ST1232 Statistics for Life Sciences 38 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 39 / 92


Mean of a Continuous Random Variable
Definition 6 (Mean of Continuous Random Variable)
The mean of a continuous random variable X , which has pdf f (x), is denoted by
the Greek letter µ, and is defined to be
Z
µ = xf (x) dx

Identical to the discrete case, it is also referred to as the expectation of a


random variable E (X ).
The interpretation of the mean is the same as for the discrete case (see slide
21).
The properties of the mean are the same as for the discrete case (see slide
22).
In Example 10, if the pdf formula was given and if we were to carry out the
integration, we would find that E (X ) = 80.
In Example 11, if the pdf formula was given and if we were to carry out the
integration, we would find that E (Y ) = 8.
Topic 04 ST1232 Statistics for Life Sciences 40 / 92
Variance of a Continuous Random Variable

Definition 7 (Variance of Continuous Random Variable)


The variance of a continuous random variable X , which has pdf f (x), is denoted
by the Greek letter σ 2 , and is defined to be
Z
σ 2 = (x − µ)2 f (x) dx

Identical to the discrete case, it is also referred to as Var (X ).


The properties of the variance are the same as for the discrete case (see slide
28).
In Example 10, if we were to carry out the integration, we would find that
Var (X ) = 144.
In Example 11, if we were to carry out the integration, we would find that
Var (Y ) = 64.

Topic 04 ST1232 Statistics for Life Sciences 41 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 42 / 92


Quantiles

Let p be a value between 0 and 1.

Definition 8 (Quantile or Percentile)


For a continuous random variable X , the p-th quantile, qp , is a value such that

P(X ≤ qp ) = p

Quantiles are also known as percentiles.

If we were to try to explain it in terms of the pdf curve, then qp is the point
on the x-axis such that the area under the curve and to the left of qp , is
equal to p.

Topic 04 ST1232 Statistics for Life Sciences 43 / 92


Quantiles of A Bell-Shaped Distribution
Consider the pdf in Example 10.

The blue area in the top diagram is


equal to 0.0478, and the blue area ends
at 60 mm Hg. Hence

q0.0478 = 60

The blue area in the middle diagram is


equal to 0.280, and the blue area ends
at 73 mm Hg. Hence

q0.280 = 73

The blue area in the bottom diagram is


equal to 0.798, and the blue area ends
at 90 mm Hg. Hence

q0.798 = 90

Topic 04 ST1232 Statistics for Life Sciences 44 / 92


Quantiles In The Bus Arrivals Example
Consider the pdf in Example 11.

On top, the blue area is equal to


0.313, and it ends at 3 mins. Hence

q0.313 = 3

In the middle, the blue area is equal


to 0.465, and it ends at 5 mins.

q0.465 = 5

At the bottom, the blue area is


equal to 0.632, and it ends at 8
mins.
q0.632 = 8

Topic 04 ST1232 Statistics for Life Sciences 45 / 92


Using Quantiles
In this course, we must be comfortable using probability tables (not integration)
to do the following:
Given a probability p, what is point x
Given a value x, what is the area under such that the area to the left of x (under
the curve to the left of it? the curve) is p?

Topic 04 ST1232 Statistics for Life Sciences 46 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 47 / 92


Random Variables and Data Types

Topic 04 ST1232 Statistics for Life Sciences 48 / 92


Random Variables and Data Types

Topic 04 ST1232 Statistics for Life Sciences 49 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 50 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 51 / 92


Combinations

Example 12 (Drawing Cards from a Deck)


In how many ways can we draw 2 cards from a well-shuffled deck of 52?
How many different combinations of 2 Aces can be drawn from a deck of 52?

This involves counting combinations of cards selected from a group of 52. The
order in which the cards were selected does not matter.

Topic 04 ST1232 Statistics for Life Sciences 52 / 92


Combinations
In Example 12, regarding the first question:
I There are 52 ways of choosing the first card.
I Having done so, there are 51 ways of choosing the second card.
I Hence using the multiplicative rule for counting (slide 24 of topic
03-probability),
52 × 51 = 2652
I However, this way double-counts each set of 2 cards selected. Hence the total
number of combinations of 2 cards from 52 is

2652/2 = 1326

I There are 1326 ways in which 2 cards can be drawn from a deck of 52.
Regarding the number of ways in which 2 Aces can be chosen:
I There are 4 ways of choosing the first card, and then 3 ways of choosing the
second.
I Hence there are 4 × 3 = 12 ways of choosing 2 Aces, if we were to
differentiate the order in which these were drawn.
I Again this double counts. Thus there are only 12/2 = 6 combinations of 2
cards from 4.

Topic 04 ST1232 Statistics for Life Sciences 53 / 92


Combinations

Definition 9 (Combinations)
The number of combinations of n things, taken k at a time is

n! n(n − 1)(n − 2) × · · · × (n − k + 1)
Ckn = =
k!(n − k)! k!

It represents the number of ways of selecting k items out of n, when the


order of selection does not matter.
It is useful to remember the following corner cases:
I C1n = n
I Cnn = 1

Topic 04 ST1232 Statistics for Life Sciences 54 / 92


Counting Combinations

Example 13 (Selecting Patients for a Trial)


Suppose we have a new developmental drug for schizophrenia.
There are 6 eligible patients at NUH, but we only have permission to
administer the drug to 3 patients. Hence we have to randomly select three of
them. How many such selections are there?
Suppose that the 6 patients consist of 4 males and 2 females. How many
selections are possible if both females must be included in the sample?

Topic 04 ST1232 Statistics for Life Sciences 55 / 92


Combinations of Patients

The number of possible outcomes is a combination of k = 3 patients from


n = 6; hence the total number of combinations is
6!
C36 = = 20
3!3!
If both females are already in the sample, then we are left to choose only one
male from the group of 4.
Thus we only have to pick a combination of k = 1 male from the n = 4 male
patients. This value is
C14 = 4

Topic 04 ST1232 Statistics for Life Sciences 56 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 57 / 92


Probability Distribution for Counts with Binary Data

In many applications, each observation is binary: It has only two possible


outcomes.

For instance, a person may:


I accept or decline a credit card offer from a bank.
I have, or not have health insurance.
I vote for or against PAP.

Under certain conditions, a Binomial distribution counts the total number of


cases for the outcome of interest.

Topic 04 ST1232 Statistics for Life Sciences 58 / 92


The Binomial Distribution

Definition 10 (Binomial Distribution)


Suppose we have
n trials, each of which has two possible outcomes. The outcome of interest is
called a success and the other outcome is called a failure.
Each trial has the same probability of success p.
The n trials are independent.
The binomial random variable X is the total number of successes in the n trials.

X can take on values 0, 1, 2, . . . , n.

We shall denote this distribution as Bin(n, p).

A Bin(1, p) distribution is also referred to as a Bernoulli trial or a Bernoulli


distribution with success probability p.

Topic 04 ST1232 Statistics for Life Sciences 59 / 92


Examples of Binomial Distribution

Suppose we plant 3 seeds, and each of them will germinate independently


with probability 0.6. Then Z , a random variable for the total number of seeds
that do germinate, follows Bin(3, 0.6) distribution.

Suppose we sample White Blood Cells (WBC) from an individual, and test if
it is a lymphocyte or not. For this individual, this will be true only 20% of the
time. If we were to obtain 10 WBC and set W to be the total number of
lymphocytes obtained, then W ∼ Bin(10, 0.2).

The probability of a woman developing breast cancer over a lifetime is 1/9.


Suppose we were to sample 50 women from Singapore independently and
follow them over their lifetime. If we set Y to be the total number of women
who develop breast cancer, then Y ∼ Bin(50, 1/9).

Topic 04 ST1232 Statistics for Life Sciences 60 / 92


Binomial Formula

Suppose that X follows a Bin(n, p) distribution.

Then the probability of x successes in these n trials is

P(X = x) = Cxn p x (1 − p)n−x

for x = 0, 1, 2, . . . , n.

Note the presence of the the combinations term.

The mean of X is E (X ) = np.

The variance of X is Var (X ) = np(1 − p).

Topic 04 ST1232 Statistics for Life Sciences 61 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 62 / 92


Using the Binomial Probability Distribution

Example 14 (DNA Sequence Alignment)


The DNA of an organism consists of very long sequences of nucleotides.
There are four different nucleotides, which are represented by the letters A,
T, G and C.
Within a species, DNA sequences change over the course of many
generations. Hence it is possible that two sequences, separately obtained, in
truth derived from the same ancestor.
Consider the following two short sequences of length 10:
G G A G A C T G T A (reference)
| | | | |
G A A C G C C C T A (query)
We wish to gauge if the two sequences show significant similarity.
What is the probability of the above outcome (i.e., 5 matches), if the query
sequence was unrelated to (i.e., random w.r.t.) the reference sequence?

Topic 04 ST1232 Statistics for Life Sciences 63 / 92


Probability of Five Matches

Picture this as 10 trials, one at each nucleotide position.


The outcome of interest is a match, which would happen with probability
0.25 if the two organisms were unrelated.
In essence, we need to find the probability of 5 matches out of 10.
Hence the probability that a random sequence of nucleotides gives rise to a
match of 5 positions is
 5  5
1 3
C510 = 0.0584
4 4

What would you conclude about the two sequences? Related or unrelated?
I In practice, we use the the probability of 5 or more matches instead of just the
probability of 5 matches to make our conclusion.

Topic 04 ST1232 Statistics for Life Sciences 64 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 65 / 92


The Poisson Distribution

This is the second most frequently used discrete distribution after Binomial
distribution. It usually associates with rare events.

Definition 11 (Poisson Distribution)


Random variable X follows Poisson distribution with parameter λ if

e −µ µk
P(X = k) = , k = 0, 1, 2, ...
k!
where e is approximately 2.71828, λ is the expected no. of events per time unit
and µ = λt is the expected no. of events over time period t.

Topic 04 ST1232 Statistics for Life Sciences 66 / 92


Example of Poisson distribution

Example 15 (Infectious Disease)


Consider a typhoid-fever example. Suppose the number of deaths from typhoid
fever over a 1-year period is Poisson distributed with parameter µ = 4.6.
What is the probability distribution of the number of deaths over a 6-months
period?
A 3-months period?

Let X be the number of deaths in 6 months. Find the probability distribution


of X.
Let Y be the number of deaths in 3 months. Find the probability distribution
os Y.

Topic 04 ST1232 Statistics for Life Sciences 67 / 92


Example of Poisson distribution (cont)
For X, because µ = 4.6, t = 1 year, it follows that λ = 4.6. For a 6-months
period, we have µ = 4.6 × 0.5 = 2.3. Therefore,

P(X = 0) = e −2.3 = 0.100

2.3 −2.3
P(X = 1) = e = 0.231
1!
2.32 −2.3
P(X = 2) = e = 0.265
2!
2.33 −2.3
P(X = 3) = e = 0.203
3!
2.34 −2.3
P(X = 4) = e = 0.117
4!
2.35 −2.3
P(X = 5) = e = 0.054
5!
P(X ≥ 6) = 1 − (0.100 + 0.231 + 0.265 + 0.203 + 0.117 + 0.054) = 0.03

Topic 04 ST1232 Statistics for Life Sciences 68 / 92


Example of Poisson distribution (cont)

For Y, because µ = 4.6, t = 1 year, it follows that λ = 4.6. For a 6-months


period, we have µ = 4.6 × 0.25 = 1.15. Therefore,

P(Y = 0) = e −1.15 = 0.317

1.15 −1.15
P(Y = 1) = e = 0.364
1!
1.152 −1.15
P(Y = 2) = e = 0.209
2!
1.153 −1.15
P(Y = 3) = e = 0.08
3!
P(Y ≥ 4) = 1 − (0.317 + 0.364 + 0.209 + 0.08) = 0.03

Topic 04 ST1232 Statistics for Life Sciences 69 / 92


Mean and Variance of the Poisson distribution

Poisson distribution with parameter µ has the mean and variance are both equal
to µ. This could help to identify if the distribution is a Poisson distribution.

Example 16 (Occupational Health)


A public health issue arose concerning the possible carcinogentic potential of food
ingredients containing ethylene dibromide (EDB). In some instances food were
removed from public consumption if they were shown to have excessive quantities
of EDB. A previous study had looked at motality if 161 white male employees of
two plants in Texas and Michigan who were exposed to EDB over the time period
1940-1975. Seven deaths from cancer were observed among these employees. For
this time period, 5.8 cancer deaths were expected as calculated from overall
mortality rates for U.S. white men. Was the observed number of cancer deaths
excessive in this group?

Topic 04 ST1232 Statistics for Life Sciences 70 / 92


Example 16

The expected number of cancer deaths from U.S. white male mortality rates
is µ = 5.8.
Let X be the number of deaths from cancer among the employees in the
study, then X is a Poisson random variable with µ = 5.8.
We need to find P(X ≥ 7).
P(X ≥ 7) = 1 − P(X ≤ 6) where P(X = k) = e −5.8 (5.8)k /(k!)
P(X ≥ 7) = 1 − 0.638 = 0.362
Clearly, the observed number of cancer deaths in the given study is not
excessive in this group.

Topic 04 ST1232 Statistics for Life Sciences 71 / 92


Poisson Approximation to the Binomial Distribution

The Binomial with large n and small p can be accurately approximated by a


Poisson distribution with parameter µ = np.
The mean f=of this distribution is np and the variance by np(1 − p) where
(1 − p) is approximately equal to 1 for small p, and thus np(1 − p) ≈ np,
that is the mean and variance are almost equal.
The Binomial distribution involves expressions Ckn and (1 − p)n−k which are
cumbersome for large n.

Topic 04 ST1232 Statistics for Life Sciences 72 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 73 / 92


The Normal Distribution

Definition 12 (Normal Distribution)


The Normal distribution is symmetric, bell-shaped and characterised by it’s
mean µ and it’s variance σ 2 .

It is also known as the Gaussian distribution.


If X is random variable that follows a Normal distribution with mean µ and
variance σ 2 , we say that X ∼ N(µ, σ 2 ) distribution.

Topic 04 ST1232 Statistics for Life Sciences 74 / 92


Properties of the Normal Distribution
The highest point of the Normal distribution curve is at x = µ.
The Normal distribution is symmetric about µ. This implies two things:
I If x > 0, the area to the left of µ − x is the same as the area to the right of
µ + x.
I q1−p = 2µ − qp

Topic 04 ST1232 Statistics for Life Sciences 75 / 92


Defining the N(µ, σ 2 ) Distribution

Normal distributions with a larger σ 2 naturally have a larger spread of values.

Topic 04 ST1232 Statistics for Life Sciences 76 / 92


An Empirical Guide to Normal Probabilities

The plot on the left gives a useful rule


of thumb for observations from a
N(µ, σ 2 ) distribution.
For instance, approximately 68% of the
observations from a N(0, 1)
distribution would fall within -1 and 1.
Similarly, the area under a N(3, 4)
distribution between -1 and 7 would be
approximately 0.95.
We can use this rule of thumb to gauge
the width of an interval that contains
95% of the data, or to compare two
graphs in terms of their varibility.

Topic 04 ST1232 Statistics for Life Sciences 77 / 92


Linear Operations on Normal Random Variables
If X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ), and X and Y are independent
random variables, then
I X + Y ∼ N(µX + µY , σX2 + σY2 )
I X − Y ∼ N(µX − µY , σX2 + σY2 )
I The addition could be of more than just two terms. In particular, if
X1 , X2 , . . . , Xn , then
x1 + ... + Xn
X̄ = =∼ N(µ, σ 2 /n)
n

For any real numbers a and b, if X ∼ N(µX , σX2 ), and Y ∼ N(µY , σY2 ) then

aX + bY ∼ N(aµX + bµY , a2 σX2 + b2 σY2 )

In particular, if X ∼ N(µ, σ 2 ) and we take a = 1/σ and b = −µ/σ, then

X −µ
Z= ∼ N(0, 1)
σ
Whenever we compute Z = (X − µ)/σ, we refer to Z as the Z -score of X .
Topic 04 ST1232 Statistics for Life Sciences 78 / 92
1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 79 / 92


Normal(0,1) Distribution Tables

Use the Normal Table to compute the values of the following, where Z ∼ N(0, 1).
P(Z > 1)
P(Z ≤ 2.3)
P(Z > 1.18)
P(Z > 1.18|Z > 0)
P(Z > 1.96 ∪ Z < −1.96)
q0.32

Topic 04 ST1232 Statistics for Life Sciences 80 / 92


Using Z -scores

To summarise:
If we are given a value X and wish to find a probability, convert
Z = (X − µ)/σ and use the table to obtain.

If we are given a probability p and asked to find qp for an X ∼ N(µ, σ 2 ), find


the corresponding quantile q ∗ for a Z ∼ N(0, 1) and then compute
qp = q ∗ σ + µ.

Topic 04 ST1232 Statistics for Life Sciences 81 / 92


Using Z -scores to Find Outliers
If the Z -score for a random variable X ∼ N(µ, σ 2 ) is greater than 3 or less
than -3, it suggests it is very different from the rest of the data.
For bell-shaped histograms, this is another way of identifying outliers.

Topic 04 ST1232 Statistics for Life Sciences 82 / 92


1 Random Variables
2 Discrete Random Variables
Parameter of a Discrete Probability Distribution: Mean
Parameter of a Discrete Probability Distribution: Variance
3 Continuous Random Variables
Parameters of a Continuous Probability Distribution: Mean and Variance
Parameter of a Continuous Probability Distribution: Quantiles
4 Random Variables And Data Types
5 The Binomial Distribution
Combinations
Assumptions
Example
6 The Poisson Distribution
7 The Normal Distribution
Z -Scores
Examples
8 Index of Definitions and Examples
9 Reading

Topic 04 ST1232 Statistics for Life Sciences 83 / 92


SAT Scores

Example 17 (SAT Scores)


The SAT is an entrance exam used in the US. Suppose that scores (X ) from this
test follow a N(µ = 1500, σ 2 = 90000) distribution. Find the following:
P(X ≤ 1800).
P(X ≥ 1630).

Topic 04 ST1232 Statistics for Life Sciences 84 / 92


SAT Scores (1)

Convert to Z -score:

P(X ≤ 1800) = P(Z ≤ (1800 − 1500)/300) = P(Z ≤ 1)

From the table, this value is 0.841.

Topic 04 ST1232 Statistics for Life Sciences 85 / 92


SAT Scores (2)

Convert to Z -score:

P(X ≥ 1630) = P(Z ≥ (1630 − 1500)/300) = P(Z ≥ 0.43)

From the table, this value is 0.3336.

Topic 04 ST1232 Statistics for Life Sciences 86 / 92


Screening for Hypertension

Example 18 (Hypertension Screening)


Suppose that you run a clinic. Patients who come in have their blood pressure
measured. The random variable X of this measurement for hypertensive
patients follows a Normal distribution with µ = 95 and σ = 12.
You wish to develop a screening test for hypertension, which has sensitivity
0.90. What is the cut-off pressure you should use?
It takes 5 minutes to serve a normal patient (with low or normal blood
pressure), but it takes 15 minutes if a patient has been detected to have high
blood pressure. What is the mean time that a nurse will spend taking the
blood pressure of a hypertensive patient, when the above screening test is
implemented?

Topic 04 ST1232 Statistics for Life Sciences 87 / 92


Screening for Hypertension

We need to find q0.10 for X ∼ N(95, 144). This is equal to 79.6 mm Hg.
If we let Y be the time that a nurse spends on a hypertensive patient, then

p5 = 0.1 and p15 = 0.9

Hence E (Y ) = 5(0.1) + 15(0.9) = 14 minutes.

Topic 04 ST1232 Statistics for Life Sciences 88 / 92


Normal Approximation to the Binomial Distribution

When n is large, the Bin(n, p) is difficult to work with and an approximation


is easier to use rather than the exact binomial distribution.
If n is moderately large and p is either near 0 or 1, then the Bin(n, p) will be
very positively or negatiely skewed, respectively.
If n is moderately large and p is not too extreme then the Bin(n, p) tends to
be symmetric and is well approximated by a normal distribution
N(np, np(1 − p)).
The normal distribution with mean np and variance np(1 − p) can be used to
approximate a binomial distribution with parameter n and p when
np(1 − p) ≥ 5.

Topic 04 ST1232 Statistics for Life Sciences 89 / 92


Topic 04 ST1232 Statistics for Life Sciences 90 / 92
Index

Def 01: Random Variable, 4 Eg 04: Two Dice, 12


Def 02: Discrete Random Variable, 10 Eg 05: Infinite Number of Outcomes, 14
Def 03: Mean, Discrete, 17 Eg 06: Computing Mean, 18
Def 04: Discrete, 26 Eg 07: Ronaldo, 20
Def 05: Continuous Random Variable, 30 Eg 08: Sure Win or Not?, 24
Def 06: Mean, Continuous, 40 Eg 09: Sure Loss or Not?, 24
Def 07: Quantile, 43 Eg 10: Systolic Blood Pressure, 32
Def 07: Variance, Continuous, 41 Eg 11: Bus Arrival Times, 35
Def 08: Combinations, 54 Eg 12: Drawing Cards, 52
Def 09: Binomial Distribution, 59 Eg 13: Patient Selection, 55
Def 10: Poisson Distribution, 66 Eg 14: DNA Sequence Alignment, 63
Def 11: Normal Distribution, 74 Eg 15: Infectious Disease, 67
Eg 01: Two Dice, 5 Eg 16: Occupational Health, 70
Eg 02: Coin Flip, 5 Eg 17: SAT Scores, 84
Eg 03: Coin Toss, 11 Eg 18: Hypertension Screening, 87

Topic 04 ST1232 Statistics for Life Sciences 91 / 92


Further Reading

Statistics: The Art and Science of Learning from Data, 3rd edition
Alan Agresti and Christine Franklin.
Read: Chapter 6

Topic 04 ST1232 Statistics for Life Sciences 92 / 92

S-ar putea să vă placă și