Documente Academic
Documente Profesional
Documente Cultură
Michael Baudin
June 2011
Abstract
In this article, we present an introduction to probabilities with Scilab.
Numerical experiments are based on Scilab. The first section presents discrete
random variables and conditionnal probabilities. In the second section, we
present combinations problems, tree diagrams and Bernouilli trials. In the
third section, we present simulation of random processes with Scilab. Coin
simulations are presented, as well as the Galton board.
Contents
1 Discrete random variables
1.1 Sets . . . . . . . . . . . . . . . . . .
1.2 Distribution function and probability
1.3 Properties of discrete probabilities . .
1.4 Uniform distribution . . . . . . . . .
1.5 Conditional probability . . . . . . . .
1.6 Life table . . . . . . . . . . . . . . .
1.7 Bayes formula . . . . . . . . . . . .
1.8 Independent events . . . . . . . . . .
1.9 Notes and references . . . . . . . . .
1.10 Exercises . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Combinatorics
2.1 Tree diagrams . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 The gamma function . . . . . . . . . . . . . . . . . . . . .
2.4 Overview of functions in Scilab . . . . . . . . . . . . . . .
2.5 The gamma function in Scilab . . . . . . . . . . . . . . . .
2.6 The factorial and log-factorial functions . . . . . . . . . . .
2.7 Computing factorial and log-factorial with Scilab . . . . .
2.8 Stirlings formula . . . . . . . . . . . . . . . . . . . . . . .
2.9 Computing permutations and log-permutations with Scilab
2.10 The birthday problem . . . . . . . . . . . . . . . . . . . .
2.11 A modified birthday problem . . . . . . . . . . . . . . . .
2.12 Combinations . . . . . . . . . . . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
7
10
11
14
16
19
20
21
.
.
.
.
.
.
.
.
.
.
.
.
21
22
22
25
28
28
30
31
33
36
38
40
43
2.13
2.14
2.15
2.16
2.17
2.18
2.19
2.20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
47
48
51
54
55
56
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
61
63
64
66
69
70
5 Answers to exercises
71
5.1 Answers for section 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Answers for section 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Bibliography
89
Index
90
In this section, we present discrete random variables. The first section presents
general definition for sets, including unions and intersections. Then we present
the definition of the discrete distribution function and the probability of an event.
In the third section, we give properties of probabilities, such as, for example, the
probability of the union of two disjoints events. The fourth section is devoted to the
very common discrete uniform distribution function. Then we present the definition
of conditional probability. This leads to Bayes formula which allows to compute
the posterior conditional probability, given a set of hypothesis probabilities. This
section finishes with the definition of independent events.
1.1
Sets
(1)
(2)
The intersection A B of two sets A and B is the set of points common to A and
B:
A B = {x A and x B} .
(3)
The union A B of two sets A and B is the set of points which belong to at least
one of the sets A or B:
A B = {x A or x B} .
(4)
The operations that we defined are presented in figure 1. These figures are often
referred to as Venns diagrams.
Two sets A and B are disjoints, or mutually exclusive if their intersection is
empty, i.e. A B = .
In the following, we will use the fact that we can always decompose the union of
two sets as the union of three disjoints subsets. Indeed, assume that A, B . We
have
A B = (A B) (A B) (B A),
4
(5)
c
A
B
A-B
(6)
(7)
Example 1.1 (Die with 6 faces) Consider a 6-faces die. The space for this experiment is
= {1, 2, 3, 4, 5, 6} .
(8)
The set of even numbers is A = {2, 4, 6} and the set of odd numbers is B = {1, 3, 5}.
Their intersection is empty, i.e. A B = which proves that A and B are disjoints.
Since their union is the whole sample space, i.e. A B = , these two sets are
mutually complement, i.e. Ac = B and B c = A.
1.2
A random event is an event which has a chance of happening, and the probability
is a numerical measure of that chance. What exactly is random is quite difficult
to define. In this section, we define the probability associated with a distribution
function, a concept that can be defined very precisely.
5
Assume that is a set, called the sample space. In this document, we will
consider the case where the sample space is finite, i.e. the number of elements in
is finite. Assume that we are performing random trials, so that each trial is
associated to one outcome x . Each subset A of the sample space is called an
event. We say that the event A B occurs if both the events A and B occur. We
say that the event A B occurs if the event A or the event B occurs.
Example 1.2 (Die with 6 faces) Consider a 6-faces die which is rolled once. The
sample space is:
= {1, 2, 3, 4, 5, 6} .
(9)
(10)
(11)
Example 1.3 (Die with 6 faces) Assume that a 6-faces die is rolled once. The
sample space for this experiment is
= {1, 2, 3, 4, 5, 6} .
(12)
Assume that the die is fair. This means that the probability of each of the six
outcomes is the same, i.e. the distribution function is f (x) = 1/6 for x , which
satisfies the conditions of the definition 1.1.
Definition 1.2. (Probability) Assume that f is a distribution function on the sample space . For any event A , the probability P of A is
X
P (A) =
f (x).
(13)
xA
Example 1.4 (Die with 6 faces) Assume that a 6-faces die is rolled once so that
the sample space for this experiment is = {1, 2, 3, 4, 5, 6}. Assume that the distribution function is f (x) = 1/6 for x . The event
A = {2, 4, 6}
6
(14)
corresponds to the statement that the result of the roll is an even number. From
the definition 1.2, the probability of the event A is
P (A) = f (2) + f (4) + f (6)
1 1 1
=
+ +
6 6 6
1
=
.
2
1.3
(15)
(16)
(17)
In this section, we present the properties that the probability P (A) satisfies. We also
derive some results for the probabilities of other events, such as unions of disjoints
events.
The following theorem gives some elementary properties satisfied by a probability
P.
Proposition 1.3. (Probability) Assume that is a sample space and that f is a
distribution function on . The probability of the event is one, i.e.
P () = 1.
(18)
(19)
(20)
0 P (A) 1.
(21)
xB
(23)
f (x) +
f (x) +
f (x).
(25)
xBA
xAB
xAB
xB
= P (A) + P (B),
(27)
(28)
Proof. For example, we can use the proposition 1.4 to state the proof by induction
on the number of events.
Example 1.5 (Die with 6 faces) Assume that a 6-faces die is rolled once so that
the sample space for this experiment is = {1, 2, 3, 4, 5, 6}. Assume that the distribution function is f (x) = 1/6 for x . The event A = {1, 2, 3} corresponds to
the numbers lower or equal to 3. The probability of this event is P (A) = 12 . The
event B = {5, 6} corresponds to the numbers greater than 5. The probability of this
event is P (B) = 13 . The two events are disjoints, so that the proposition 1.5 can be
applied, which implies that P (A B) = 56 .
8
(29)
Proof. We have = A Ac , where the sets A and Ac are disjoints. Therefore, from
proposition 1.4, we have
P () = P (A) + P (Ac ),
(30)
(31)
The figure 3 presents the situation where two sets A and B have a non empty
intersection. When we add the probabilities of the two events A and B, the intersection is added twice. This is why it must be removed by subtraction.
Proof. Assume that A and B are two subsets of . The proof is based on the analysis
of Venns diagram presented in figure 3. The idea of the proof is to compute the
9
(32)
(33)
The next part of the proof is based on the computation of P (A B) and P (B A).
We can decompose the set A as the union of disjoints sets
A = (A B) (A B),
(34)
(35)
(36)
(37)
1.4
Uniform distribution
In this section, we describe the particular situation where the distribution function
is uniform.
Definition 1.8. (Uniform distribution) Assume that is a finite, nonempty, sample
space. The uniform distribution function is
f (x) =
1
,
#()
for all x .
10
(38)
(39)
P (A) =
(40)
1
#()
xA
(41)
#(A)
,
#()
(42)
P (A) =
Proof. The definition 1.2 implies
f (x)
xA
1.5
Conditional probability
In this section, we define the conditional distribution function and the conditional
probability. We analyze this definition in the particular situation of the uniform
distribution.
In some situations, we want to consider the probability of an event A given that
an event B has occurred. In this case, we consider the set B as a new sample space,
and update the definition of the distribution function accordingly.
Definition 1.10. (Conditional distribution function) Assume that is a sample
space and that f is a distribution function on . Assume that A is a nonempty
subset of . The function f (x|A) defined by
(
P f (x)
, if x A,
xA f (x)
(43)
f (x|A) =
0, if x
/ A,
is the conditional distribution function of x given A.
P
The hypothesis that A is nonempty implies P (A) = xA f (x) > 0, so that the
equality 43 is well defined.
The figure 4 presents the situation where an event A is considered for a conditionnal distribution. The distribution function f (x) is with respect to the sample
space while the conditionnal distribution function f (x|A) is with respect to the
set A.
11
Indeed, we have
X
f (x|A) =
f (x|A) +
xA
f (x|A)
(45)
xA
/
X
P
xA
f (x)
xA f (x)
= 1,
(46)
(47)
(48)
P (A B)
.
P (B)
(49)
The figure 5 presents the situation where we consider the event A|B. The probability P (A) is with respect to while the probability P (A|B) is with respect to
B.
Proof. Assume that A and B are subsets of the sample space . The conditional
distribution function f (x|B) can be used to compute the probability of the event A
given the event B. Indeed, we have
X
P (A|B) =
f (x|B)
(50)
xA
xAB
12
f (x|B)
(51)
AB
Figure 5: The conditionnal probability P (A|B) measures the probability of the set
A B with respect to the set B.
since f (x|B) = 0 if x
/ B. Hence,
P (A|B) =
f (x)
xB f (x)
P
xAB
P
xAB f (x)
= P
xB f (x)
P (A B)
=
.
P (B)
(52)
(53)
(54)
#(AB)
#()
#(B)
#()
(55)
#(A B)
.
#(B)
(56)
We notice that
#(B) #(A B)
#(A B)
=
,
#() #(B)
#()
(57)
(58)
for all A, B . The previous equation could have been directly found based on
the equation 49.
13
Age Group
<1
1-4
5-9
10-14
15-19
20-24
25-29
30-34
35-39
40-44
45-49
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85-89
90-94
95-99
100+
Male
100000
99276
99156
99085
98989
98573
97887
97223
96526
95665
94396
92487
89643
85726
80364
72889
62860
49846
34096
18315
7198
1940
Female
100000
99391
99292
99232
99164
98991
98758
98484
98133
97621
96823
95603
93850
91384
87726
82275
74398
63218
48086
30289
14523
4804
1.6
Life table
In this section, we analyze a life table in order and compute conditional probabilities
with Scilab.
The World Health Organization gives life tables for many countries in the world
[14]. The table 6 gathers data compiled in the USA in 2009. The first line counts
100,000 born alive males and females, with decreasing values when the age is increasing.
From this table, we can easily deduce that the probability that 91384/100000 =
91.384 % of the females live to age 60 and that 63218/100000 = 63.218 % of the
females live to age 80. We consider a women who is 60. What is the probability
that she lives to age 80 ?
Let us denote by A = {a 60} the event that a woman lives to age 60, and let
us denote by B = {a 80} the event that a woman lives to age 80. We want to
compute the conditionnal probability P ({a 80}|{a 60}). By the proposition
14
1.11, we have
P ({a 60} {a 80})
P ({a 60})
P ({a 80})
=
P ({a 60})
0.63218
=
0.91384
= 0.6918,
(59)
(60)
(61)
(62)
with 4 significant digits. In other words, a women who is already 60, has 69.18 %
of chance to live to 80.
It is easy to gather the data into Scilab variables, as in the following Scilab script.
The ages variable contains the age classes. It is made of 22 entries, where the class
#k is from age ages(k-1)+1 to ages(k). The number of male survivors in the class
#k is males(k), while the number of female survivors is females(k).
ages = [ 0 ; 4 ; 9 ; 1 4 ; 1 9 ; 2 4 ; 2 9 ; 3 4 ; 3 9 ; 4 4 ; 4 9 ; 5 4 ; . .
59;64;69;74;79;84;89;94;99;100];
males = [ 1 0 0 0 0 0 ; 9 9 2 7 6 ; 9 9 1 5 6 ; 9 9 0 8 5 ; 9 8 9 8 9 ; 9 8 5 7 3 ; . .
97887;97223;96526;95665;94396;92487;89643;85726;
80364;72889;62860;49846;34096;18315;7198;1940];
females = [ 1 0 0 0 0 0 ; 9 9 3 9 1 ; 9 9 2 9 2 ; 9 9 2 3 2 ; 9 9 1 6 4 ; 9 8 9 9 1 ; 9 8 7 5 8 ; . .
98484;98133;97621;96823;95603;93850;91384;
87726;82275;74398;63218;48086;30289;14523;4804];
The following lifeprint function prints the data and displays a table similar to
the figure 6.
function lifeprint ( ages , males , females )
nc = size ( ages , " * " );
mprintf ( " <1
%6d %6d \ n " , males (1) , females (1));
for k = 2: nc
amin = ages (k -1) + 1;
amax = ages ( k );
mprintf ( " %3d - %3d %6d %6d \ n " ,..
amin , amax , males ( k ) , females ( k ));
end
endfunction
We are now interested to compute the required probabilities with Scilab. The
following lifeproba function returns the probability that a person can live to age
a, given the datas in the tables ages and survivors. In practice, the survivors
variable will be either equal to males or to females. The algorithm first searches
in the ages table the class k which contains the age a. We compute the index k
such that a is contained in the age class from ages(k-1)+1 to ages(k). Then the
probability of living to age a is computed by using the number of survivors associated
to the class k.
function p = lifeproba (a , ages , survivors )
nc = size ( ages , " * " );
for k = 2: nc
if ( a >= ages (k -1) + 1 & a <= ages ( k ) ) then
break
15
end
end
if ( k == nc ) then
error ( " Age not found in table " )
end
p = survivors ( k )/ survivors (1)
endfunction
Although the previous algorithm is rather naive (it does not make use of vectorized
statements), this is sufficient for small life tables such as in our case.
The following session shows how the lifeproba function computes the probability of living to age 60.
--> pa = lifeproba (60 , ages , females )
pa =
0.91384
The following lifecondproba returns the probability that a person with age a
can live to age b, given the datas in the tables ages and survivors. We assume
that a < b. The algorithm is a straightforward application of the proposition 1.11.
function p = lifecondproba (a , b , ages , survivors )
if ( a >= b ) then
p = 1
return
end
pa = lifeproba (a , ages , survivors )
pb = lifeproba (b , ages , survivors )
p = pb / pa
endfunction
In the following session, we compute the probability that a woman lives to age 80,
given that she is 60.
--> pab = lifecondproba (60 , 80 , ages , females )
pab =
0.6917841
It is now easy to compute the probability that a female live to various ages, given
that she is 40. This is done in the following script, which produces the figure 7.
bages = floor ( linspace (41 ,99 ,20));
for k = 1 : 20
pab ( k ) = lifecondproba (40 , bages ( k ) , ages , females );
end
plot ( bages , pab , " bo - " );
xtitle ( " Probability of living to age B , for a women of age 40. " ,..
" B " ," Probability " );
1.7
Bayes formula
In this section, we present Bayes formula, which allows to compute the posterior
conditional probability, given a set of hypotheses probabilities.
Proposition 1.12. ( Bayes formula) Assume that the sample space can be decomposed in a sequence of events, which are called hypotheses. Let us denote by
16
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
40
50
60
70
80
90
100
Age B
Figure 7: Probability that a female live to various ages, given that she is 40.
Hi with i = 1, m a sequence of pairwise disjoints sets such that
= H1 H2 . . . Hm ,
(63)
P (E|Hi )P (Hi )
.
i=1,m P (E|Hi )P (Hi )
(64)
In order to use Bayes formula, assume that the probabilities of each hypothesis
is known, that is, assume that the probabilities P (Hi ) are given for 1 i m.
Assume that the probabilities P (E|Hi ) are known for 1 i m. Therefore, we
are able to compute P (Hi |E), that is, if the event E has occurred, we are able to
compute the probability of each hypothesis Hi . In practice, we consider the most
likely hypothesis Hi for which the probability P (Hi |E) is maximum over i = 1, m.
Proof. By definition of the conditional probability, we have
P (Hi |E) =
P (Hi E)
.
P (E)
(65)
17
(66)
(67)
We can now plug 66 and 69 into 65, which concludes the proof.
The following exercise is presented in [17], in chapter 1, Elements of probability.
Example 1.9 Consider the situation where an insurance company tries to compute
the probability of having an accident. This company makes the assumption that
people are accident prone or are not accident prone and consider the probability of
having an accident during the 1 year period following the insurance policy purchase.
They assume that an accident-prone person will have an accident with probability
0.4. For a non-accident-prone person, the probability of having an accident is 0.2.
Assume that 30 % percent of the population is accident prone. Assume that a person
has an accident. What is the probability that he is accident prone ?
Let us denote by E the event that the person will have an accident within the
year of purchase and denote by H1 the event that the person is accident prone.
By hypothesis, the sample space of all the persons can be decomposed with the
pairwise disjoints sets H1 and H2 = H1c , which are, in this particular situation the
hypotheses. By hypothesis, we know that P (E|H1 ) = 0.4 and P (E|H2c ) = 0.2. We
also know that P (H1 ) = 0.3, which implies that P (H2 ) = 1 P (H1 ) = 0.7. We
want to compute P (H1 |E). By Bayes formula 64, we have
P (E|H1 )P (H1 )
P (E|H1 )P (H1 ) + P (E|H2 )P (H2 )
0.4 0.3
=
0.4 0.3 + 0.2 0.7
0.4615,
P (H1 |E) =
18
(70)
(71)
(72)
1.8
Independent events
In this section, we define independent events and give exemples of such events.
Definition 1.13. (Independent events) Assume that is a finite sample space. The
event two events A, B are independent if P (A) > 0 and P (B) > 0 and
P (A|B) = P (A),
(73)
In the previous definition, the roles of A and B can be reversed, which leads to
the following result. If two events A, B are independent, then
P (B|A) = P (B).
(74)
(75)
Proof. Assume that the two events A, B are satisfying P (A) > 0 and P (B) > 0.
In the first part of this proof, let us prove that 75 is satisfied. By definition of
the conditional probability 1.11, we have P (A B) = P (A|B)P (B). Since A and B
are independent, we have P (A|B) = P (A), which leads to P (A B) = P (A)P (B)
and concludes the first part.
In the second part, let us assume that 75 is satisfied, and let us prove that the
events A and B are independent. By definition of the conditional probability 1.11,
we have
P (A|B) =
P (A B)
.
P (B)
(76)
P (A)P (B)
= P (A),
P (B)
(77)
P (B)P (A)
P (A)
The proposition 1.14 has an important and rather subtle consequence which is
presented in the following example. This example is based on the Historial remarks
of section 1.2 in [7], which present the results collected by Gerolamo Cardano (15011576).
19
Example 1.10 (Die with 6 faces) Assume that a 6-faces of a fair die is rolled twice
(instad of once) and consider the problem of choosing the correct sample space. The
correct sample space takes into account for the order of the rolls and is
= {(i, j) / i, j = 1, 6} .
(78)
The set as 6 6 = 36 elements. The wrong (for our purpose), unordered, sample
space is
2 = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6),
(2, 2), (2, 3), (2, 4), (2, 5), (2, 6),
(3, 3), (3, 4), (3, 5), (3, 6), (4, 4), (4, 5), (4, 6)
(5, 5), (5, 6), (6, 6)}.
(79)
The set 2 as 21 elements. The fact that the set 2 is the wrong sample space for
this experiment is linked to the fact that the two rolls are independent. Therefore,
the equality 75 can be applied, stating that the probabilities of two independent
events is the product of the probabilities. Therefore, the probability of any event
(i, j) is
P ((i, j)) = P (i)P (j),
(80)
1
,
36
(81)
for all i, j = 1, 6. The only sample space which can leave this probability consistent
with the uniform distribution equality of a finite space 38, is . If the sample space
2 was chosen, the equality 75 would be violated so that the two events would
become dependent.
1.9
The material for section 1.1 is presented in [11], chapter 1, Set, spaces and measures. The same presentation is given in [7], section 1.2, Discrete probability
distributions. The example 1.3 is given in [7].
The equations 21, 18 and 23 are at the basis of the probability theory so that in
[17], these properties are stated as axioms.
In some statistics books, such as [8] for example, the union of the sets A and B
is denoted by the sum A + B, and the intersection of sets is denoted by the product
AB. We did not use these notations in this document.
The section 1.6 on life tables is inspired by an example presented in [7], in section
4.1, Discrete conditional Probability. We have updated the data from 1990 to 2009
and introduce the associate Scilab function.
20
1.10
Exercises
Exercise 1.1 (Head and tail ) Assume that we have a coin which is tossed twice. We record the
outcomes so that the order matters, i.e. the sample space is HH, HT, T H, T T . Assume that the
distribution function is uniform, i.e. each of the head and the tail have an equal probability.
1. What is the probability for the event A = HH, HT, T H ?
2. What is the probability for the event A = HH, HT ?
Exercise 1.2 (Two dice) Assume that we are rolling a pair of dice. Assume that each face has
an equal probability.
1. What is the probability of getting a sum of 7 ?
2. What is the probability of getting a sum of 11 ?
3. What is the probability of getting a double one, i.e. snakeeyes ?
Exercise 1.3 (M
er
es experiments) This exercise is presented in [7], in the historical remarks
of section 1.2, Discrete Probability Distributions. Famous letters between Pascal and Fermat
were investigated by a request for help from a French nobleman and gambler, Chevalier de Mere.
It is said that de Mere had been betting that, in four rolls of a die, at least one six would turn up
(event A). He was winning consistently and, to get more people to play, he changed the game to
get that, in 24 rolls of two dice, a pair of 6 would turn up (event B). It is claimed that de Mere
lost with 24 and felt that 25 were necessary to make the game favorable (event C). What is the
probability for the three events A, B and C ? Can you compute with Scilab the probability for
event A and a number of rolls equal to 1, 2, 3 or 4 ? Can you compute with Scilab the probability
for the event B or C for a number of rolls equal to 10, 20, 24, 25, 30 ?
Exercise 1.4 (Independent events) Assume that is a finite sample space. Assume that the
two events A, B are independent. Prove that P (B|A) = P (B).
Exercise 1.5 (Booles inequality ) Assume that is a finite sample space. Let (Ei )i=1,n be a
sequence of finite sets included in and n > 0. Prove Booles inequality:
[
X
P
Ei
P (Ei ).
(82)
i=1,n
i=1,n
Combinatorics
21
Figure 8: Tree diagram - The task is made with 3 steps. There are 2 choices for the
step #1, 3 choices for step #2 and 2 choices for step #3. The total number of ways
to perform the full sequence of steps is n = 2 3 2 = 12.
2.1
Tree diagrams
In this section, we present the general method which allows to count the total number
of ways that a task can be performed. We illustrate that method with tree diagrams.
Assume that a task is carried out in a sequence of n steps. The first step can be
performed by making one choice among m1 possible choices. Similarly, there are m2
possible ways to perform the second step, and so forth. The total number of ways
to perform the complete sequence can be performed in n = m1 m2 . . . mn different
ways.
To illustrate the sequence of steps, the associated tree can be drawn. An example
of such a tree diagram is given in the figure 8. Each node in the tree corresponds to
one step in the sequence. The number of children of a parent node is equal to the
number of possible choices for the step. At the bottom of the tree, there are N leafs,
where each path, i.e. each sequence of nodes from the root to the leaf, corresponds
to a particular sequence of choices.
We can think of the tree as representing a random experiment, where the final
state is the outcome of the experiment. In this context, each choice is performed
at random, depending on the probability associated with each branch. We will
review tree diagrams throughout this section and especially in the section devoted
to Bernoulli trials.
2.2
Permutations
In this section, we present permutations, which are ordered subsets of a given set.
Definition 2.1. ( Permutation) Assume that A is a finite set. A permutation of A
is a one-to-one mapping of A onto itself.
Without loss of generality, we can assume that the finite set A can be ordered
and numbered from 1 to n = #(A), so that we can write A = {1, 2, . . . , n}. To define
a particular permutation, one can write a matrix with 2 rows and n columns which
22
1
2
3
2
3
Figure 9: Tree diagram for the computation of permutations of the set A = {1, 2, 3}.
represents the mapping. One example of a permutation on the set A = {a1 , a2 , a3 , a4 }
is
1 2 3 4
=
,
(83)
2 1 4 3
which signifies that the mapping is:
a1 a2 ,
a2 a1 ,
a3 a4 ,
a3 a4 .
Since the first row is always the same, there is no additional information provided
by this row. This is why the permutation can be written by uniquely defining the
second row. This way, the previous mapping can be written as
= 2 1 4 3 .
(84)
We can try to count the number of possible permutations of a given set A with
n elements.
The tree diagram associated with the computation of the number of permutations
for n = 3 is presented in figure 9. In the first step, we decide which number to place
at index 1. For this index, we have 3 possibilities, that is, the numbers 1, 2 and
3. In the second step, we decide which number to place at index 2. At this index,
we have 2 possibilities left, where the exact numbers depend on the branch. In the
third step, we decide which number to place at index 3. At this last index, we only
have one number left.
This leads to the following proposition, which defines the factorial function.
Proposition 2.2. ( Factorial) The number of permutations of a set A of n elements
is the factorial of n defined by
n! = n (n 1) . . . 2 1.
23
(85)
(86)
Example 2.1 Let us compute the number of permutations of the set A = {1, 2, 3}.
By the equation 85, we have 6! = 3 2 1 = 6 permutations of the set A. These
permutations are:
(1
(1
(2
(2
(3
(3
2
3
1
3
1
2
3)
2)
3)
1)
2)
1)
(87)
The previous permutations can also be directly read from the tree diagram 9, from
the root of the tree to each of the 6 leafs.
In some situations, all the elements in the set A are not involved in the permutation. Assume that j is a positive integer, so that 0 j n. A j-permutation is
a permutation of a subset of j elements in A. The general counting method used
for the previous proposition allows to count the total number of j-permutations of
a given set A.
Proposition 2.3. ( j-permutations) Assume that j is a positive integer. The number of j-permutations of a set A of n elements is
(n)j = n (n 1) . . . (n j + 1).
(88)
Proof. The element at index 1 can be located at indexes 1, 2, . . . , n so that there are
n ways to set the element at index 1. Once element at index 1 is placed, there are
n 1 ways to set the element at index 2. The element at index j can only be set at
the remaining n j + 1 indexes. The total number of j-permutations is therefore
n (n 1) . . . (n j + 1), which concludes the proof.
24
(89)
(90)
Example 2.2 Let us compute the number of 2-permutations of the set A = {1, 2, 3, 4}.
By the equation 88, we have (4)2 = 4 3 = 12 permutations of the set A. These
permutations are:
(1 2)
(1 3)
(1 4)
(2 1)
(2 3)
(2 4)
(3 1)
(3 2)
(3 4)
(4 1)
(4 2)
(4 3)
(91)
2.3
In this section, we present the gamma function which is closely related to the factorial function. The gamma function was first introduced by the Swiss mathematician
Leonard Euler in his goal to generalize the factorial to non integer values[18]. Efficient implementations of the factorial function are based on the gamma function
and this is why this functions will be analyzed in detail. The practical computation
of the factorial function will be analyzed in the next section.
Definition 2.4. ( Gamma function) Let x be a real with x > 0. The gamma function
is defined by
Z 1
( log(t))x1 dt.
(92)
(x) =
0
The previous definition is not the usual form of the gamma function, but the
following proposition allows to get it.
Proposition 2.5. ( Gamma function) Let x be a real with x > 0. The gamma
function satisfies
Z
(x) =
tx1 et dt.
(93)
0
25
For any continuously differentiable function f and any real numbers a and b.
Z a
Z b
f (x)dx.
(95)
f (x)dx =
b
We reverse the bounds of the integral in the equality 94 and get the result.
The gamma function satisfies
Z
et dt = et 0 = (0 + e0 ) = 1.
(1) =
(96)
The following proposition makes the link between the gamma and the factorial
functions.
Proposition 2.6. ( Gamma and factorial) Let x be a real with x > 0. The gamma
function satisfies
(x + 1) = x(x)
(97)
(n + 1) = n!
(98)
and
(99)
The proof is based on the integration by parts formula. For any continuously differentiable functions f and g and any real numbers a and b, we have
Z b
Z b
b
0
f (t)g (t)dt = [f (t)g(t)]a
f (t)0 g(t)dt.
(100)
a
Let us define f (t) = tx and g 0 (t) = et . We have f 0 (t) = xtx1 and g(t) = et . By
the integration by parts formula 100, the equation 99 becomes
Z
x t
(x + 1) = t e 0 +
xtx1 et dt.
(101)
0
x t
Let us introduce the function h(t) = t e . We have h(0) = 0 and limt h(t) = 0,
for any x > 0. Hence,
Z
(x + 1) =
xtx1 et dt,
(102)
0
The gamma function is not the only function f which satisfies f (n) = n!. But the
Bohr-Mollerup theorem prooves that the gamma function is the unique function f
which satisfies the equalities f (1) = 1 and f (x+1) = xf (x), and such that log(f (x))
is convex [2].
It is possible to extend this function to negative values by inverting the equation
97, which implies
(x) =
(x + 1)
,
x
(103)
(x + 2)
,
x+1
(104)
which leads to
(x) =
(x + 2)
.
x(x + 1)
(105)
(x + n)
.
x(x + 1) . . . (x + n 1)
(106)
Proof. The proof is by induction on n. The equation 103 prooves that the equality
is true for n = 1. Assume that the equality 106 is true for n et let us proove that it
also holds for n + 1. By the equation 103 applied to x + n, we have
(x + n) =
(x + n + 1)
.
x+n
(107)
Therefore, we have
(x) =
(x + n + 1)
x(x + 1) . . . (x + n 1)(x + n)
(108)
which proves that the statement holds for n + 1 and concludes the proof.
The gamma function is singular for negative integers values of its argument, as
stated in the following proposition.
Proposition 2.8. ( Gamma function for integer negative arguments) For any non
negative integer n,
(n + h)
when h is small.
27
(1)n
,
n!h
(109)
factorial returns n!
gamma
returns (x)
gammaln returns log((x))
Figure 10: Scilab commands for permutations.
Proof. Consider the equation 106 with x = n + h. We have
(n + h) =
But (h) =
(h+1)
,
h
(h)
.
(h n)(h n + 1)) . . . (h + 1)
(110)
which leads to
(n + h) =
(h + 1)
.
(h n)(h n + 1)) . . . (h + 1)h
(111)
2.4
2.5
The gamma function allows to compute (x) for real input argument. The mathematical function (x) can be extended to complex arguments, but this has not be
implemented in Scilab.
The following script allows to plot the gamma function for x [4, 4].
x = linspace ( -4 , 4 , 1001 );
y = gamma ( x );
plot ( x , y );
28
-2
-4
-6
-4
-3
-2
-1
Notice that the two floating point signed zeros +0 and -0 are associated with the
function values and +. This is consistent with the value of the limit of the
function from either sides of the singular point. This contrasts with the value of the
gamma function on negative integer points, where the function value is %nan. This
is consistent with the fact that, on this singular points, the function is equal to
on one side and + on the other side. Therefore, since the argument x has one
single floating point representation when it is a negative nonzero integer, the only
solution consistent with the IEEE754 standard is to set the result to %nan.
Notice that we used 1001 points to plot the gamma function. This allows to get
29
4.0e+006
3.5e+006
3.0e+006
2.5e+006
2.0e+006
1.5e+006
1.0e+006
5.0e+005
0.0e+000
1
10
2.6
In the following script, we plot the factorial function for values of n from 1 to 10.
f = factorial (1:10)
plot ( 1:10 , f , "b - o " )
The result is presented in figure 12. We see that the growth rate of the factorial
function is large.
The largest values of n so that n! is representable as a double precision floating point number is n = 170. In the following session, we check that 171! is not
representable as a Scilab double.
--> factorial (170)
ans =
7.257+306
--> factorial (171)
ans =
Inf
(112)
Logarithm of n!
800
700
600
log(n!)
500
400
300
200
100
0
0
20
40
60
80
100
120
140
160
180
2.7
In this section, we present how to compute the factorial function in Scilab. We focus
in this section on accuracy and efficiency.
The factorial function returns the factorial number associated with the given
n. It has the following syntax:
f = factorial ( n )
31
The statement n(n==0)=1 allows to set all zeros of the matrix n to one, so that
the next statements do not have to manage the special case 0! = 1. Then, we use
the cumprod function in order to compute a column vector containing cumulated
products, up to the maximum entry of n. The use of cumprod allows to get all the
results in one call, but also produces unnecessary values of the factorial function.
In order to get just what is needed, the statement v = t(n(:)) allows to extract
the required values. Finally, the statement f = matrix(v,size(n)) reshapes the
matrix of values so that the shape of the output argument is the same as the shape
of the input argument.
The following function allows to compute n! based on the prod function, which
computes the product of its input argument.
function f = factorial_naive ( n )
f = prod (1: n )
endfunction
The factorial_naive function has two drawbacks. The first one is that it cannot manage matrix input arguments. Furthermore, it requires more memory than
necessary.
In practice, the factorial function can be computed based on the gamma function.
The following implementation of the factorial function is based on the equality 98.
function f = myfactorial ( n )
if ( ( or ( n (:) < 0) ) | ( n (:) <> round ( n (:) ) ) ) then
error ( " myfactorial : n must all be nonnegative integers " );
end
f = gamma ( n + 1 )
endfunction
The myfactorial function also checks that the input argument n is positive. It also
checks that n is an integer by using the condition ( n(:) <> round (n(:) ).
Indeed, if the value of n is different from the value of round(n), this means that the
input argument n is not an integer.
The main drawback of the factorialScilab function is that it uses more memory than necessary. It may fail to produce a result when it is given a large input
argument. In the following session, we use the factorial function with a very large
input integer. In this particular case, it is obvious that the correct result is Inf.
Anyway, this should have been the result of the function, which should not have
generated an error. On the other side, the myfactorial function works perfectly.
--> factorial (1. e10 )
! - - error 17
32
We now consider the computation of the log-factorial function fln . We can use
the gammaln function which directly provides the correct result.
function flog = factoriallog ( n )
flog = gammaln ( n +1)
endfunction
The advantage of this method is that matrix input arguments can be manage by the
factoriallog function.
There is another possible implementation for the log-factorial function, based on
the logarithm function. We have
log(n!) = log(1 2 3 n)
= log(1) + log(2) + log(3) + . . . + log(n).
(113)
(114)
The previous equation can be simplified since log(1) = 0. This leads to the following
implementation.
function flog = f ac t o r ia l l o g _l e s s n ai v e ( n )
flog = sum ( log (2: n ))
endfunction
The previous function has several drawbacks. The first problem is that this function
may require a large array if n is large. Hence, using this function is limited to
relatively small values of n. Moreover, it requires to evaluate the log function at
least n 1 times, which leads to a performance issue. Finally, it is not possible to
directly let the variable n be a matrix of doubles. All these issues make the function
factoriallog_lessnaive a less than perfect implementation of the log-factorial
function.
2.8
Stirlings formula
In this section, we present Stirlings formula [2, 19], which allows to get an asymptotic
formula for the gamma function.
Let us recall the definition of the asymptotic symbol .
Definition 2.9. ( Asymptotic) Let {an }n0 and {bn }n0 be two real sequences. We
say that an is asymptotically equal to bn if
an
= 1.
n bn
lim
(115)
(116)
We shall not give the proof of this formula here (see [2, 19] for a complete
derivation).
The previous proposition allows to directly derive an asymptotic behavior for
the factorial function.
Proposition 2.11. ( Stirlings formula) For any positive integer n ,
n n
2n.
n!
e
(117)
(118)
(119)
(120)
We can simplify the previous expression for large values of n. Indeed, we have
en1 en and (n + 1)n nn , which concludes the proof.
In the following script, we compare Stirlings formula and the factorial function
for various values of n. Moreover, we compute the number of significant
digits
n!Sn
produced by Stirlings formula, by using the equation d = log10 n! , where Sn
is Stirlings approximation given by the equation 117.
-->n = (1:20:185) ;
-->f = factorial ( n );
-->s = sqrt (2* %pi .* n ).*( n ./ %e ).^ n ;
-->d = - log10 ( abs (f - s )./ f );
- - >[ n f s d ]
ans =
1.
1.
0.9221370
21.
5.109 D +19
5.089 D +19
41.
3.345 D +49
3.338 D +49
61.
5.076 D +83
5.069 D +83
81.
5.79 D +120
5.79 D +120
101.
9.42 D +159
9.41 D +159
121.
8.09 D +200
8.08 D +200
141.
1.89 D +243
1.89 D +243
161.
7.59 D +286
7.58 D +286
181.
Inf
Inf
1.1086689
2.4022947
2.692415
2.8648116
2.9878919
3.0836832
3.1621171
3.2285294
3.2861201
Nan
We see that the number of significant digits increases with n, reaching more than 3
when n is close to its upper limit.
It is not straightforward to proove Stirlings formula. Still, we can easily proove
the following proposition, which focuses on the log-factorial function.
Proposition 2.12. ( Log-factorial for large n) For any positive integer n ,
log(n!) n log(n) n.
(121)
(122)
(123)
The previous equation can be simplified, since log(1) = 0. The log function is a
nondecreasing function of x for x > 0. Therefore,
Z k
Z k+1
log(x)dx < log(k) <
log(x)dx,
(124)
k1
(125)
We must now compute the two integrals which appear in the previous inequalities. We recall that the anti-derivative of the log(x) function is x log(x) x, since
(x log(x) x)0 = log(x) + x x1 1 = log(x). Furthermore, we recall that the limit
of the function x log(x) is zero when x converges to zero. Therefore,
Z n
log(x)dx = [x log(x) x]n0
(126)
0
= n log(n) n
(127)
and
Z
n+1
(128)
= (n + 1) log(n + 1) (n + 1) + 1
= (n + 1) log(n + 1) n.
(129)
(130)
We plug the two previous results into the equation 125 and get
n log(n) n < log(n!) < (n + 1) log(n + 1) n,
(131)
(133)
= log
+ log( 2n)
en
= n log
+ log( 2) + log( n)
(134)
e
1
1
= n log(n) n log(e) + log(2) + log(n)
(135)
2
2
1
1
= n log(n) n + log(2) + log(n),
(136)
2
2
since log(e) = 1. This immediately implies the equation 121.
In the following session, we compare the asymptotic equation 121 with the gammaln function. The session displays the column vectors [n f s d], where f is
computed from the gammaln function, s is computed from the asymptotic equation
121 and d is the number of significant digits in the asymptotic equation. We see that,
for n greater than 1010 , we have more than 10 significant digits in the asymptotic
equation.
35
2.9
13.025851
360.51702
5907.7553
82103.404
1051292.5
12815511.
1.512 D +08
1.742 D +09
1.972 D +10
2.203 D +11
0.8613409
2.0526167
3.1309743
4.1721275
5.1972489
6.2141578
7.2263182
8.2354866
9.2426477
10.248397
(137)
(138)
(139)
In the following session, we see that the previous function works for small values of
n and j.
-->n = [5 5 5 5 5 5] ;
-->j = [0 1 2 3 4 5] ;
- - >[ n j p er m u t at i o n s _v e r y n ai v e (n , j )]
ans =
5.
0.
1.
5.
1.
5.
5.
2.
20.
5.
3.
60.
5.
4.
120.
5.
5.
120.
36
In the following session, we check the values of the function (n)j for n = 5 and
j = 1, 2, . . . , 5.
-->n = 5;
--> for j = 0 : 5
--> p = permutations_naive ( n , j );
--> disp ([ n j p ]);
--> end
5.
0.
1.
5.
1.
5.
5.
2.
20.
5.
3.
60.
5.
4.
120.
5.
5.
120.
37
(140)
(141)
(142)
2.10
In this section, we consider a practical computation of a probability, based on permutations. We present in this example the Scilab functions which allow to perform
the computations.
38
Assume that n > 0 persons are gathered in a room. Can we compute the
probability that two persons in the room have the same birthday ?
To perform this computation, we assume that the year is made of 365 days. We
assume that each day has the same probability of a birthday.
Let us denote by Nn the sample space, which is the space of all possible
combinations of birthdays for n persons. It is defined by
= {(i1 , . . . , in ) / i1 , . . . , in = 1, 2, . . . , 365} .
(143)
The searched event E is the event that two persons have the same birthday. To
compute this event more easily, we can compute the probability of the complementary event E c , i.e., the event that all persons have a distinct birthday. Indeed, once we have computed P (E c ), we can deduce the searched probability with
P (E) = 1 P (E c ).
By hypothesis, all days in the year have the same probability so that we can
c)
.
apply proposition 1.9 for uniform discrete distributions. Therefore, P (E c ) = #(E
#()
The birthday of the each person can be chosen from 365 days. Since the birthdays
of several persons are independent events, this leads to
#() = 365n .
(144)
Let us now compute the size of the complementary event E c . The birthday of
the first person can be chosen among 365 possible days. Once chosen, the birthday
for the second person can be chosen among 365-1 = 364 days, so that the birthdays
are different. By repeating this process for the n persons, we get the following size
for the event E c :
#(E c ) = 365.364 . . . (365 n + 1) = (365)n .
(145)
Let us define the probability Q(E) as the probability of the complementary event:
Q(E) = P (E c ).
(146)
(365)n
365n
(147)
(365)n
.
365n
(148)
39
We are going to use the previous function with d = 365. In the following session,
we see that the twobirthday_verynaive does not allow to compute any result,
whatever the value of n.
-->n = (1:5) ;
- - >[ n tw o bi rt h da y_ v er y na iv e (n ,365)]
ans =
1.
Nan
2.
Nan
3.
Nan
4.
Nan
5.
Nan
In the following session, we compute several the probability that two persons have
the same birthday for n going from 1 to 10.
-->n = (1:5) ;
- - >[ n twobirthday_naive (n ,365)]
ans =
1.
0.
2.
0.0027397260273972490197
3.
0.0082041658847813447863
4.
0.0163559124665503263785
5.
0.0271355736996392593596
We can explore larger values of n and see when the probability p break the p =
0.5 threshold. The following Scilab session shows how to compute the required
probability for n = 20, 21, . . . , 25.
-->n = (20:25) ;
- - >[ n twobirthday_naive (n ,365)]
ans =
20.
0.4114383835806317835093
21.
0.4436883351652584073221
22.
0.4756953076625709542213
23.
0.5072972343239776638057
24.
0.5383442579144532835755
25.
0.5686997039695230737877
This shows that if more than 23 persons are gathered, there is a favorable probability
that two persons have the same birthday.
2.11
We now consider a modified birth problem. We have computed that 23 persons are
sufficient to make a favorable bet that two persons are born the same day. We are
interested in the event that two persons are born on the same day, at the same hour.
More precisely, we would like to compute the number of persons in the group so that
40
the probability of having two persons with the same birth day and hour is greater
than 0.5.
We assume that a year is made of 365 days and that a day is made of 24 hours.
This computation should be easy to perform. It suffices to use the twobirthday_naive function with d = 365 24. In the following session, we explore the
values of P (E) for n = 20, 21, . . . , 25.
-->n = (20:25) ;
- - >[ n twobirthday_naive (n ,365*24)]
ans =
20.
0.0214717378650828294440
21.
0.0237058206494269452236
22.
0.0260462518925431707473
23.
0.0284922544755111806225
24.
0.0310430168113396964813
25.
0.0336976934779864567560
We see that the probability is much lower than previously. The problem is that,
for larger values of n, the previous function fails, as can be seen in the following
session.
--> twobirthday_naive (100 ,365*24)
ans =
Nan
The same issue occurs for (365 24)100 . This leads to the ratio Inf/Inf, which is
equal to the IEEE Nan. In order to solve this issue, we can use logarithms of the
intermediate results. We have
(365)n
(149)
log(Q(E)) = log
365n
= log((365)n ) log(365n )
(150)
= log((365)n ) n log(365).
(151)
This leads to the following twobirthday_lessnaive function, which uses the permutationslog function.
function p = t wo b ir th d ay _ le ss n ai ve ( n , d )
q = exp ( permutationslog ( d , n ) - n * log ( d ))
p = 1 - q
endfunction
We check in the following session that our less naive function allows to compute the
required probability for n = 100.
--> tw ob i rt hd a y_ l es sn a iv e (100 ,365*24)
ans =
0.4329003041011145747063
41
In order to search for the number of persons which makes the probability be greater
that p = 0.5, we perform a while loop. We quit this loop when the probability
breaks the threshold.
n = 1;
while ( %t )
p = tw ob i rt hd a y_ l es sn a iv e ( n , 365*24 );
if ( p > 0.5 ) then
mprintf ( " n = %d , p = %e \ n " ,n , p )
break
end
n = n + 1;
end
We are now interested in computing the probability of getting two persons with
the same birth day and hour in a group of 500 persons. The following session shows
the result of calling the twobirthday_lessnaive with n = 500.
-->p = t wo b ir th d ay _ le ss n ai ve ( 500 , 365*24 )
p =
0.9999995
This probability is very close to 1. In order to get more significant digits, we use
the format function in the following session.
--> format (25)
-->p
p =
0.9999995054082023715480
This allows to get a larger number of significant digits. But the result is only accurate
at most to 17 significant digits. Since 6 of these digits are digits are 9, there are
only 17-6=11 digits available for the required probability p. The reason for this
inaccuracy is that q is very close to zero, which makes p = 1 q be very close to
1. This implies that, because of our way to compute the probability p, we can at
best expect 11 accurate digits for p. We emphasize that this is independent from
the actual accuracy of the intermediate computations and is only generated by the
way of representing the solution of the problem with double precision floating point
numbers.
One possible solution is to compute q instead of p. Indeed, the q value is close
to zero, where floating point numbers can be represented with limited but sufficient
accuracy. The following twobirthday function allows to compute both p and q.
function [ p , q ] = twobirthday ( n , d )
q = exp ( permutationslog ( d , n ) - n * log ( d ))
p = 1 - q
endfunction
In the following session, we display the result of the computation and use the
mprintf function with the %.17e format in order to display 17 digits after the
decimal point.
42
We now have all the available digits for q and we know that the probability of having
two persons with the same birth day and hour in a group of n = 500 persons is close
p = 1 4.94591797644640870 107 .
But this does not imply that all these digits are all exact. Indeed, floating
point evaluations of elementary operators like +, -, *, / and elementary and special
functions like exp, log and gamma are associated with rounding errors and various
approximations.
We have computed the probability for n = 500 with the symbolic computation
system Wolfram Alpha [16] and used the expresssion:
(365* 24)!/(3 65*24 - 500)!/( (365*24) ^500)
rounded to 17 digits. In the following session, we compute the relative error between
the computed and the exact probabilities.
-->z =4. 94 591 79 764 020 77 20 e -7
z =
0.0000005
--> abs (q - z )/ z
ans =
8.963 D -12
-->- log10 ( abs (q - z )/ z )
ans =
11.047534
We see that there are approximately 11 digits accurate in this case which is sufficiently accurate for our purpose.
2.12
Combinations
In this section, we present combinations, which are unordered subsets of a given set.
The number of distinct subsets with j elements which can be
from a set
chosen
n
A with n elements is the binomial coefficient and is denoted by
. The following
j
proposition gives an explicit formula for the binomial number.
Proposition 2.13. ( Binomial) The number of distinct subsets with j elements
which can be chosen from a set A with n elements is the binomial coefficient and is
defined by
n.(n 1) . . . (n j + 1)
n
.
(152)
=
j
1.2 . . . j
The following proof is based on the fact that subsets are unordered, while permutations are based on the order.
43
Proof. Assume that the set A has n elements and consider subsets with j > 0
elements. By proposition 2.3, the number of j-permutations of the set A is (n)j =
n.(n1) . . . (nj +1). Notice that the order does not matter in creating the subsets,
so that the number of subsets is lower than the number of permutations. This is
why each subset is associated with one or more permutations. By proposition 2.2,
there are j! ways to order asetwith j elements. Therefore, the number of subsets
n
, which concludes the proof.
with j elements is given by
= n.(n1).....(nj+1)
1.2...j
j
The expression for the binomial coefficient can be simplified if we use the number
of j-permutations and the factorial number, which leads to
(n)j
n
=
.
(153)
j
j!
The equality (n)j =
n!
(nj)!
leads to
n!
n
.
=
j
(n j)!j!
(154)
n
n
=
.
j
nj
(155)
2.13
function c = nchoosek ( n , j )
c = exp ( gammaln ( n +1) - gammaln ( j +1) - gammaln (n - j +1))
if ( and ( round ( n )== n ) & and ( round ( j )== j ) ) then
b = round ( b )
end
endfunction
In the following session, we compute the value of the binomial coefficients for
n = 1, 2, . . . , 5. The values in this table are known as Pascals triangle.
--> for n =0:5
--> for j =0: n
-->
c = nchoosek ( n , j );
-->
mprintf ( " %2d
" ,c );
--> end
--> mprintf ( " \ n " );
--> end
1
1
1
1
2
1
1
3
3
1
1
4
6
4
1
1
5
10
10
5
1
We now explain why we choose to use the exp and the gammaln to perform our
computation for the nchoosek function. Indeed, we could have used a more naive
method, based on the prod function, as in the following example :
function c = nchoosek_naive ( n , j )
c = prod ( n : -1 : n - j +1 )/ prod (1: j )
endfunction
For small integer values of n, the two previous functions produce the same result.
Unfortunately, even for moderate values
of n, the naive method fails. In the following
n
session, we compute the value of
with n = 10000 and j = 134.
j
--> nchoosek ( 10000 , 134 )
ans =
2.050+307
--> nchoosek_naive ( 10000 , 134 )
ans =
Inf
The reason why the naive computation fails is because the products involved in
the intermediate variables for the naive method are generating an overflow. This
means that the values are too large for being stored in a double precision floating
point variable. This is a pity, since the result can be stored in a double precision
floating point variable. The nchoosek function, on the other hand, computes first the
logarithm of the factorial number. This logarithm cannot overflow because if x is a
double precision floating point number, then log(x) can always be represented since
its exponent is always smaller than the exponent of x. In the end, the combination
of the exp and gammaln functions allows to accurately compute the result in the
sense that, if the result is representable as a double precision floating point number,
then nchoosek will produce a result as accurate as possible.
45
Notice that we use the round function in our implementation of the nchoosek
function. This is because the nchoosek function manages in fact real double precision floating point input arguments. Consider the example
where n = 4 and j = 1
n
and let us compute the associated number of nchoosek
. In the following Scilab
j
session, we use the format so that we display at least 15 significant digits.
--> format (20);
-->n = 4;
-->j = 1;
-->c = exp ( gammaln ( n +1) - gammaln ( j +1) - gammaln (n - j +1))
c =
3.99 99999999 9999822
We see that there are 15 significant digits, which is the best that can be expected
from the exp and gammaln functions. But the result is not an integer anymore,
i.e. it is very close to the integer 4, but not exactly equal to it. This is why in
the nchoosek function, if n and j are both integers, we round the number c to the
nearest integer with a call to the round function.
Finally, notice that our implementation of the nchoosek function uses the function and. This allows
to
use arrays of integers as input variables. In the following
5
session, we compute
, for j = 0, 1, . . . , 5 in one single call. This is a consequence
j
of the fact that the exp and gammaln both accept matrices input arguments.
-->n = 5 * ones (6 ,1);
-->j = (0:5) ;
-->c = nchoosek ( n , j );
- - >[ n j c ]
ans =
5.
0.
1.
5.
1.
5.
5.
2.
10.
5.
3.
10.
5.
4.
5.
5.
5.
1.
It appears, as we will see later in this document, that the number of combinations
appears in several probability computations. For example, this function is used as
an intermediate computation in the hypergeometric distribution function which will
be presented later in this document. We will see that the numerical issue associated
with the use of floating point numbers is solved by the use of the logarithm of the
number of combinations, and this is why we now focus on this computation.
k
Let us introduce the function clog as the logarithm of the number
:
x
n
clog (n, j) = log
.
(159)
j
The functions fln and clog are related by the equation
n!
n
,
=
j
(n j)!j!
46
(160)
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
9
9
9
9
10
10
10
10
J
J
J
J
Q
Q
Q
Q
K
K
K
K
Figure 14: Cards of a 52 cards deck - J stands for Jack, Q stands for Queen and
K stands for King
Name
no pair
pair
double pair
three of a kind
straight
flush
full house
four of a kind
straight flush
royal flush
Description
none of the below combinations
two cards of the same rank
2 two cards of the same rank
three cards of the same rank
five cards in a sequence, not all the same suit
five cards in a single suit
one pair and one triple, each of the same rank
four cards of the same rank
five in a sequence in a single suit
10, J, Q, K, 1 in a single suit
Example
7 3 6 3 1
Q Q 2 3 1
2 2 Q Q
2 2 2 3 1
3 4 5 6 7
2 3 7 J K
2 2 2 Q Q
5 5 5 5 2
2 3 4 5 6
10 J Q K 1
(161)
The following nchooseklog computes clog (n, j) from the equation 161.
function c = nchooseklog ( n , k )
c = gammaln ( n + 1 ) - gammaln ( k + 1) - gammaln ( n - k + 1)
endfunction
2.14
In the following example, we use Scilab to compute the probabilities of poker hands.
The poker game is based on a 52 cards deck, which is presented in figure 14.
Each card can have one of the 13 available ranks from 1 to K, and have on the 4
available suits , , and . Each player receives 5 cards randomly chosen in
the deck. Each player tries to combine the cards to form a well-known combination
of cards as presented in the figure 15. Depending on the combination, the player
can beat, or be defeated by another player. The winning combination is the rarest;
that is, the one which has the lowest probability. In figure 15, the combinations are
presented in decreasing order of probability.
Even if winning at this game requires some understanding of human psychology,
understanding probabilities can help. Why does the four of a kind beats the full
house ?
47
To answer this question, we will compute the probability of each event. Since the
order of the cards can be changed by the player, we are interested in combinations
(and not in permutations). We make the assumption that the process of choosing
the cards is really random, so that all combinations of 5 cards have the same probabilities, i.e. the distribution function is uniform. Since the order of the cards does
not matter, the sample space is the set of all combinations of 5 cards chosen from
52 cards. Therefore, the size of is
52
# =
= 2598960.
(162)
5
The probability of a four of a kind is computed as follows. In a 52-cards deck,
there are 13 different four of a kind combinations. Since the 5-th card is chosen at
random from the 48 remaining cards, there are 13.48 different four of a kind. The
probability of a four of a kind is therefore
624
13.48
0.0002401
P (four of a kind) = =
2598960
52
.
5
(163)
2.15
Bernoulli trials
In this section, we present Bernoulli trials and the binomial discrete distribution
function. We give the example of a coin tossed several times as an example of such
a process.
Definition 2.15. A Bernoulli trials process is a sequence of n > 0 experiments with
the following rules.
48
p
p
S
q
S
q
(start)
p
F
q
Figure 16: A Bernoulli process with 3 trials. The letter S indicates success and
the letter F indicates failure.
1. Each experiment has two possible outcomes, which we may call success and
failure.
2. The probability p [0, 1] of success of each experiment is the same for each
experiment.
In a Bernoulli process, the probability p of success is not changed by any knowledge of previous outcomes For each experiment, the probability q of failure is
q = 1 p.
It is possible to represent a Bernoulli process with a tree diagram, as the one in
figure 16.
A complete experiment is a sequence of success and failures, which can be represented by a sequence of Ss and Fs. Therefore the size of the sample space is
#() = 2n , which is equal to 23 = 8 in our particular case of 3 trials.
By definition, the result of each trial is independent from the result of the previous
trials. Therefore, the probability of an event is the product of the probabilities of
each outcome.
Consider the outcome x = SF S for example. The value of the distribution
function f for this outcome is
f (x = SF S) = pqp = p2 q.
(165)
The table 17 presents the value of the distribution function for each outcome x .
We can check that the sum of probabilities of all events is equal to 1. Indeed,
X
f (xi ) = p3 + p2 q + p2 q + pq 2 + p2 q + pq 2 + pq 2 + q 3
(166)
i=1,8
= p3 + 3p2 q + 3pq 2 + q 3
= (p + q)3
= 1
(167)
(168)
(169)
x
SSS
SSF
SFS
SFF
FSS
FSF
FFS
FFF
f (x)
p3
p2 q
p2 q
pq 2
p2 q
pq 2
pq 2
q3
=
=
=
=
p3
3p2 p
3pq 2
q3
(170)
(171)
(172)
(173)
The following proposition extends the previous analysis to the general case.
Proposition 2.16. ( Binomial probability) In a Bernoulli process with n > 0 trials
with success probability p [0, 1], the probability of exactly j successes is
n j nj
b(n, p, j) =
pq ,
(174)
j
where 0 j n and q = 1 p.
Proof. We denote by A the event that one process is associated with exactly j
successes. By definition, the probability of the event A is
X
b(n, p, j) = P (A) =
f (x).
(175)
xA
(176)
The size of the set A is the number of subsets of j elements in a set of size n. Indeed,
the order does not matter since we only require that, during the whole process, the
total number of successes is exactly j, no matter of the order of the successesand
n
failures. The number of outcomes with exactly j successes is therefore #(A) =
,
j
which, combined with equation 176, concludes the proof.
50
Example 2.3 A fair coin is tossed six times. What is the probability that exactly 3
heads turn up ? This process is a Bernoulli process with n = 6 trials. Since the coin
is fair, the probability of success at each trial is p = 1/2. We can apply proposition
2.16 with j = 3 and get
6
b(6, 1/2, 3) =
(1/2)3 (1/2)3 0.3125,
(177)
3
so that the probability of having exactly 3 heads is 0.3125.
2.16
We can check that our implementation is correct by checking against two simple
examples.
Example 2.4 A fair coin is tossed six times. What is the probability that exactly
3 heads turn up ?
-->n = 6; pb = 0.5;
-->p = binopdf_naive ( 3 , n , pb
p =
0.3125
Example 2.5 Assume that we work in a factory producing n = 100 puffins each
day. The probability of producing a defective puffin is 2%. What is the probability
that exactly 0 puffins are produced ?
-->n = 200; pb = 2/100;
-->p = binopdf_naive ( 0 , n , pb
p =
0.0175879
We can now consider an example where the probability of success of each Bernoulli
trial is far more likely. Assume that the probability of sucess is p = 1 1020 , i.e.
there is an extremely strong probability of success. Assume that we make n = 100
trials. What is the probability of having 90 sucesses ?
-->n = 100; pb = 1 - 1. e -20;
--> binopdf_naive ( 90 , n , pb )
ans =
0.
51
In order to check our computation, we can use Wolfram Alpha [16], with the expression :
nchoosek(100,90) * (1-10^-20)^90 * (10^-20)^(100-90)
The exact result is
p = 1.73103094564399999... 10187 .
(178)
This number is small, but it is representable by the double precision floating point
numbers used in Scilab. The reason of the failure of the naive implementation is
because the probability of success is so close to 1 that is has been rounded to one
by Scilab, as show in the following session.
--> format ( " e " ,25)
--> pb = 1 - 1. e -20
pb =
1. 00 000 00 000 000 00 000 D +00
Hence the probability of failure qb is represented by the floating point number zero,
leading to an inaccurate computation. In fact, any probability smaller than the
machine epsilon 1016 would lead to the same issue.
The following binopdf_lessnaive function is a more accurate implementation
of the Binomial distribution function. Instead of computing the complementary
probability qb from pb, it takes it directly as an input argument.
function p = binopdf_lessnaive ( x , n , pb , qb )
p = nchoosek (n , x ) .* pb ^ x .* qb ^( n - x )
endfunction
In the previous example, there is an obvious difference between the naive and the
accurate implementations. But there are cases where the difference between the two
functions is less obvious, leading to the false feeling that the naive implementation
is accurate. This happens in particular where the complementary probability q is
close to zero, but larger than the machine precision, that is, larger than 1016 in
the context of Scilab. In this case, the computed value of qb=1-pb is nonzero, but
does not have full accuracy: its digits are mainly driven by the rounding errors.
There is still something wrong with our less naive implementation. Indeed, consider the following case, where we use a probability pb close to one and a particularily
chosen value x.
-->n = 1. e9 ; pb = 1 - 1. e -14; qb = 1. e -14;
-->x = 1. e9 - 48
x =
52
999999952.
--> binopdf_lessnaive ( x , n , pb , qb )
ans =
Nan
We can compute the exact result with Wolfram Alpha and get
e = 8.05 5386429 9160072 1 e -302
The previous number is close to the low limit available for double precision normalized floating point numbers. We can analyze what happens here by computing the
intermediate terms which appear in the computation, as in the following session.
--> nchoosek (n , x )
ans =
Inf
--> pb ^ x
ans =
0.9999900080431765037048
--> qb ^( n - x )
ans =
0.
In the following session, we check that our implementation gives an accurate result.
-->n = 1. e9 ; pb = 1 - 1. e -14; qb = 1. e -14;
-->x = 1. e9 - 48
x =
999999952.
--> binopdf ( x , n , pb , qb )
ans =
8.055383937519171497 -302
By looking more closely at the previous result, we see that the order of magnitude
is good, but that all digits are not exact. In the following session, we compute the
number of decimal significant digits from the formula d = log10 (|c e|/e), where
e is the exact result and c is the computed result.
-->d = - log10 ( abs (p - e )/ e )
d =
6.5094691877632762100347
53
We see that the accurate implementation has about 6 significant digits, which is
far less than the maximum achievable precision. Indeed, the maximum achievable
precision is 15 significant digits for Scilab doubles. In order to measure the sensitivity of the output arguments depending on the input arguments, we compute the
condition number of the binomial distribution function for this particular value of x.
In the following session, we compute the probability p1 of a slightly modified input
argument x + 2 * %eps * x. Then we compute the relative difference of the input
rx and the relative difference of the output ry. The condition number is defined as
the ratio between these two relative differences.
--> p1 = binopdf ( x + 2 * %eps * x , n , pb , qb )
p1 =
8.055461212394521478 -302
--> rx = 2* %eps
rx =
4. 44 089 20 985 006 26 162 D -16
--> ry = abs ( p2 - p )/ p
ry =
9. 81 753 23 288 868 64 651 D -07
-->c = ry / rx
c =
2. 21 071 17 469 036 34 071 D +09
We see that the condition number is close to 1010 . This means that a relatively
small variation of the input argument x generates a relatively large change in the
output probability p. In our particular case, a very small variation of x can make
the probability varying suddenly from values close to zero to values close to 1.
Hence, the computation is numerically difficult and this explains why the accuracy of the binopdf function is sometimes less than maximum. We emphasize
that this is not a problem which is specific to the binopdf function: it is, indeed, a
problem which comes from the behaviour of the function itself, which varies greatly
for particular values of the input argument x.
2.17
2.18
We can see that the naive implementation fails to compute the h(x = 200, m =
1030, k = 500, n = 515) which exact value is 1.65570 1010 . This example is
used by Yalta in [21] to prove that Excel 2007 is inaccurate with respect to the
hypergeometric function.
-->m =1030; k =500; n =515;
--> hygepdf_naive ( 200 , m , k , n )
ans =
0.
The reason of this failure is because some intermediate term involved in the
computation of h(x, m, k, n) is too large to be represented in a double precision
floating point number. Indeed, the computation of h(x = 200, m =
1030,
k =
515
515
500, n = 515) involves the computation of the intermediate terms
,
200
300
55
1030
and
. The following session shows that the last term is represented as the
500
Infinity number of the IEEE-754 standard.
--> nchoosek (1030 ,515)
ans =
Inf
We can solve the problem by computing first the logarithm of the probability
and then exponentiating the result. Let us introduce the function hlog defined by
hlog (x, m, k, n) = log(h(x, m, k, n))
k
mk
m
= log(
) + log(
) log(
).
x
nx
n
(181)
(182)
(183)
The
computation
of the function log-combination function, i.e. the clog (n, k) =
n
log
function, has been presented in the section 2.13.
k
The following hygepdf function finally computes the hypergeometric function
from the equations 182 and 183.
function p = hygepdf ( x , m , k ,
c1 = nchooseklog (
k,
x
c2 = nchooseklog ( m -k , n - x
c3 = nchooseklog ( m ,
n
p_log = c1 + c2 - c3
p = exp ( p_log )
endfunction
n )
)
)
)
2.19
Combinatorics topics presented in section 2 can be found in [7], chapter 3, Combinatorics. The example of Poker hands is presented in [7], while the probabilities of
all Poker hands can be found in Wikipedia [20].
Parts of the section 2.8, dedicated to Stirlings formula are based on [7], chapter
3, Combinatorics.
There is some confusion related to the hypergeometric distribution function,
since many texts use various symbols for the distribution function, its parameters
56
and the order of these parameters. The notation for the hypergeometric distribution
function h presented in section 2.17 is taken from Matlab. The letters chosen for
the parameters and the orders of the arguments are the one chosen in Matlab.
In an earlier version of this document, we considered [7], section 5.1, Important
Distributions. Indeed, Grinstead and Snell chose to denote the number of items
(or balls) in the urn by the capital letter N , which may lead to bugs in Scilab
implementations, because of the variable n, denoting the number of samples (number
of balls selected). Also, in Matlab, the total number of balls in the urn (i.e. m)
comes as the first parameter of the distribution function, while it is the last one in
Grinstead and Snell. For practical reasons, we choose to keep Matlabs choice.
The gamma function presented in section 2.3 is covered in many textbook, as in
[1]. An in-depth presentation of the gamma function is done in [18].
2.20
Exercises
Exercise 2.1 (Recurrence relation of binomial ) Prove the proposition 2.14, which is the
following. For integers n > 0 and 0 < j < n, the binomial coefficients satisfy
n
n1
n1
=
+
.
(184)
j
j
j1
Exercise 2.2 (Number of subsets) Assume that is a finite set with n 0 elements. Prove
that there are 2n subsets of . Consider that and are proper subsets of .
Exercise 2.3 (Probabilities of Poker hands) Why does computing the probability of a straight
flush forces to take into account the probability of the royal flush ? Explain other possible conflicts
between Poker hands. Compute the probabilities for all Poker hands in figure 15.
This exercise is partly given in [7], in section 3.2, Combinations.
Exercise 2.4 (Bernoulli trial for a die experiment) A die is rolled n = 4 times. What is the
probability that we obtain exactly one 6 ? What is the probability for n = 1, 2, . . . , 12 ?
Exercise 2.5 (Probability of a flight crash) Assume that there are 20 000 flights of airplanes
each day in the world. Assume that there is one accident every 500 000 flights. What is the
probability of getting exactly 5 crash in 22 days ? What is the probability of getting at least 5
crash in 22 days ? What is the probability of getting exactly 3 crash in 42 days ? What is the
probability of getting at least 3 crash in 42 days ? Consider now all that the year is a sequence of
16 periods of 22 days (ignoring the 13 days left in the year). What is the probability of having one
period in the year which contains at least 5 crash ?
This exercise is presented in La loi des series noires, by Janvresse and de la Rue [9].
Exercise 2.6 (Binomial function maximum) Consider the dicrete distribution function of a
Bernoulli process, as defined by 2.16. Show that
p nj+1
b(n, p, j 1),
(185)
b(n, p, j) =
q
j
for j 1. Compute jm 1 so that b(n, p, j) is maximum, i.e. so that
b(n, p, jm ) b(n, p, j),
1 j n.
(186)
Consider the experiment presented in section 3.4, which consists in tossing a coin 10 times and
counting the number of heads. With a Scilab simulation, can you compute what is the number of
heads which is the most likely to occur ?
This exercise is given in [7], in the exercise part of section 3.2, Combinations.
57
Exercise 2.7 (Binomial coefficients and Pascals triangle) This exercise is given in [7], in
chapter 3, Combinatorics. Let a, b be two real numbers and let n be a positive integer. Prove
the binomial theorem which states that
X n
n
(a + b) =
aj bnj .
(187)
j
j=0,n
n
The binomial coefficients
can be written in a triangle, where each line corresponds to n and
j
each row corresponds to j, as in the following array
1.
1. 1.
(188)
A = 1. 2. 1.
1. 3. 3. 1.
1. 4. 6. 4. 1.
Use the binomial theorem in order to prove that the sum of the terms in the n-th row is 2n . Prove
that if the terms are added with alternating signs, then the sum is zero.
Binomial coefficients can also be represented in a matrix called Pascals matrix, where the
binomial coefficients are stored in the diagonals of the matrix.
1. 1. 1. 1.
1.
1. 2. 3. 4.
5.
(189)
A = 1. 3. 6. 10. 15.
(190)
j=0,n
To help to prove this result, consider a set with 2n elements, where n elements are red and n
elements are blue. Compute the number of ways to choose n elements in this set.
This exercise is given in [7], in chapter 3, Combinatorics.
Exercise 2.9 (Earthquakes and predictions) Assume that a person predicts the dates of
major earthquakes (with magnitude larger than 6.5 or with a large number of deaths, etc...) in
the world during 3 years, i.e. in a period of 1096 days. Assume that the specialist predicts 169
earthquakes. Assume that, during the same period, 196 major earthquakes really occur, so that 33
earthquakes were correctly predicted by the specialist. What is the probability that earthquakes
are predicted by chance ?
This exercise is presented by Charpak and Broch in [4].
Exercise 2.10 (Log-factorial function) There is another possible implementation of the logfactorial function. Indeed, we have, by definition
n! = 1.2. . . . .n,
(191)
(192)
which implies
fln (n)
58
(193)
In this section, we present how to simulate random events with Scilab. The problem
of generating random numbers is more complex and will not be detailed in this
chapter. We begin by a brief overview of random number generation and detail
the random number generator used in the rand function. Then we analyze how to
generate random numbers in the interval [0, 1] with the rand function. We present
how to generate random integers in a given interval [0, m 1] or [m1 , m2 ]. In the
final part, we present a practical simulation of a game based on tossing a coin.
3.1
Overview
un =
(194)
where m is a large integer and xn is a positive integer so that 0 < xn < m . In many
random number generators, the integer xn+1 is computed from the previous element
in the sequence xn .
The linear congruential generators [10] are based on the sequence
xn+1 = (axn + c)
(mod m),
(195)
where
m is the modulus satisfying m > 0,
a is the multiplier satisfying 0 a < m,
c is the increment satisfying 0 c < m,
x0 is the starting value satisfying 0 x0 < m.
The parameters m, a, c and x0 should satisfy several conditions so that the
sequence xn has good statistical properties. Indeed, naive approaches leads to poor
results in this matter. For example, consider the example where x0 = 0, a = c = 7
and m = 10. The following sequence of number is produced :
0.6
0.9
0.
0.7
0.6
0.9
0.
0.7 . . .
(196)
Specific rules allow to design the parameters of a uniform random number generator.
As a practical example, consider the Urand generator [12] which is used by Scilab
in the rand function. Its parameters are
m = 231 ,
59
a = 843314861,
c = 453816693,
x0 arbitrary.
The first 8 elements of the sequence are
0.2113249
0.7560439
0.0002211
0.3303271
0.6653811
0.6283918
0.8497452
0.6857310 . . .
3.2
(197)
(198)
In this section, we present some Scilab features which allow to simulate a discrete
random process.
We assume here that a good source of random numbers is provided.
Scilab provides two functions which allow to generate uniform real numbers in the
interval [0, 1]. These functions are rand and grand. Globally, the grand function
provides much more features than rand. The random number generators which
are used are also of higher quality. For our purpose of presenting the simulation of
random discrete events, the rand will be sufficient and have the additional advantage
of having a simpler syntax.
The simplest syntax of the rand function is
rand ()
Each call to the rand function produces a new random number in the interval [0, 1],
as presented in the following session.
--> rand ()
ans =
0.2113249
--> rand ()
ans =
0.7560439
--> rand ()
ans =
0.0002211
60
ans
=
0.6040239
--> rand ()
ans =
0.0079647
In most random processes, several random numbers are to use at the same time.
Fortunately, the rand function allows to generate a matrix of random numbers,
instead of a single value. The user must then provide the number of rows and
columns of the matrix to generate, as in the following syntax.
rand ( nr , nc )
The use of this feature is presented in the following session, where a 2 3 matrix of
random numbers is generated.
--> rand (2 ,3)
ans =
0.6643966
0.9832111
3.3
0.5321420
0.4138784
0.5036204
0.6850569
In this section, we present how to use a uniform random number generator to generate integers in a given interval. Assume that, given a positive integer m, we want to
generate random integers in the interval [0, m 1]. To do this, we can use the rand
function, and multiply the generated numbers by m. We must additionally use the
floor function, which returns the largest integer smaller than the given number.
The following function returns a matrix with size nrnc, where entries are random
integers in the set {0, 1, . . . , m 1}.
function ri = generateInRange0M1 ( m , nbrows , nbcols )
ri = floor ( rand ( nbrows , nbcols ) * m )
endfunction
In the following session, we generate random integers in the set {0, 1, . . . , 4}.
-->r = generateInRange0M1 ( 5 , 4 , 4 )
r =
2.
0.
3.
0.
1.
0.
2.
2.
0.
1.
1.
4.
4.
2.
4.
0.
To check that the generated integers are uniform in the interval, we compute the
distribution of the integers for 10000 integers in the set {0, 1, . . . , 4}. We use the bar
to plot the result, which is presented in the figure 18. We check that the probability
of each integer is close to 15 = 0.2.
-->r = generateInRange0M1 ( 5 , 100 , 100 );
--> counter = zeros (1 ,5);
--> for i = 1:100
--> for j = 1:100
--> k = r (i , j );
--> counter ( k +1) = counter ( k +1) + 1;
61
0.1976
0.2005
We emphasize that the previous verifications allow to check that the empirical
distribution function is the expected one, but that does not guarantee that the
uniform random number generator is of good quality. Indeed, consider the sequence
xn = n (mod 5). This sequence produces uniform integers in the set {0, 1, . . . , 4},
but, obviously, is far from being truly random. Testing uniform random number
generators is a much more complicated problem and will not be presented here.
It is easy to adapt the previous function to various needs. For example, the following function returns a matrix with size nrnc, where entries are random integers
in the set {1, 2, . . . , m}.
function ri = generateInRange1M ( m , nr , nc )
ri = ceil ( rand ( nr , nc ) * m )
endfunction
The following function returns a matrix with size nrnc, where entries are random
integers in the set {m1 , m1 + 1 . . . , m2 }.
function ri = generateInRangeM12 ( m1 , m2 , nr , nc )
f = m2 - m1 + 1
ri = floor ( rand ( nr , nc ) * f ) + m1
endfunction
62
3.4
Simulation of a coin
Many practical experiments are very difficult to analyze by theory and, most of the
time, very easy to experiment with a computer. In this section, we give an example
of a coin experiment which is simulated with Scilab. This experiment is simple, so
that we can check that our simulation matches the result predicted by theory. In
practice, when no theory is able to predict a probability, it is much more difficult to
assess the result of simulation.
The following Scilab function generates a random number with the rand function
and use the floor in order to get a random integer, either 1, associated with Head,
or 0, associated with Tail. It prints out the result and returns the value.
// tossacoin -//
Prints " Head " or " Tail " depending on the simulation .
//
Returns 1 for " Head " , 0 for " Tail "
function face = tossacoin ( )
face = floor ( 2 * rand () );
if ( face == 1 ) then
mprintf ( " Head \ n " )
else
mprintf ( " Tail \ n " )
end
endfunction
With such a function, it is easy to simulate the toss of a coin. In the following
session, we toss a coin 4 times. The seed argument of the rand is used so that the
seed of the uniform random number generator is initialized to 0. This allows to get
consistent results across simulations.
rand ( " seed " ,0)
face = tossacoin
face = tossacoin
face = tossacoin
face = tossacoin
();
();
();
();
Assume that we are tossing a fair coin 10 times. What is the probability that
we get exactly 5 heads ?
This is a Bernoulli process, where the number of trials is n = 10 and the probability is p = 5. The probability of getting exactly j = 5 heads is given by the
binomial distribution and is
P (exactly 5 heads in 10 toss) = b(10, 1/2, 5)
10 5 105
=
pq
,
5
(199)
(200)
(201)
The following Scilab session shows how to perform the simulation. Then, we
perform 10000 simulations of the process. The floor function is used in combination
with the rand function to generate integers in the set {0, 1}. The sum allows to count
the number of heads in the experiment. If the number of heads is equal to 5, the
number of successes is updated accordingly.
--> rand ( " seed " ,0);
--> nb = 10000;
--> success = 0;
--> for i = 1: nb
--> faces = floor ( 2 * rand (1 ,10) );
--> nbheads = sum ( faces );
--> if ( nbheads == 5 ) then
-->
success = success + 1;
--> end
--> end
--> pc = success / nb
pc =
0.2507
3.5
In this section, we define the binomial distribution function. We give the example
of the Galton board which is based on a Bernoulli process and simulate this random process. We present the Scilab functions which allow to manage the binomial
function and present their use on the Galton board example.
Definition 3.1. ( Binomial distribution function) Let n be a positive integer and let
p be a real in the interval [0, 1]. Assume that the random variable B is the number of
successes in a Bernoulli process with parameters n and p. The distribution function
b(n, p, j) of B is the binomial distribution.
A Galton board is board in which a ball is dropped at the top of the board and
deflected off a number of pins on their way down to the bottom of the board. The
location of the ball at the end of one trial is the result of random deflections either
to the right or to the left. The figure 19 presents a Galton board.
The following function allows to simulate one fall of a ball on a Galton board.
It takes as its input argument the number n of steps in the Galton board, so that
the final number of cups (where the ball fall) is n + 1. At each stage, a random
number is generated with the rand. If the random number is lower than 1/2, the
ball is deflected to the left. If not, the ball is deflected to the right. Each time
a ball is deflected to the right, the integer jmin is increased. After n deflections,
the variables jmin is the index of the cup into which the ball has fallen. The same
algorithm can be designed with the jmax index, with initial value n + 1. That index
would be decreased each time the ball is deflected to the left.
64
In the following Scilab script, we perform 10 000 experiments of the Galton board.
The cups variables stores the number of balls in each cup. For each experiment, we
update the number of balls in the cup which has been randomly generated by the
process. The bar function allows to plot the figure.
rand ( " seed " , 0 )
n = 10
cups = zeros (1 , n +1)
nshots = 10000
for k = 1: nshots
j = simulgalton ( n );
cups ( j ) = cups ( j ) + 1;
65
0.20
0.15
0.10
0.05
0.00
1
10
11
Figure 20: Simulation of a Galton board with n = 10 stages and 100 simulations.
The line is the binomial distribution function.
end
bar (1: n +1 , cups )
The figures 20, 21 and 21 presents the result of the simulation of a Galton board
for n = 100, n = 1000 and n = 1000 as bar plots.
In the figure 23, we present Scilab functions which allow to manage the binomial
distribution. The cdfbin cumulated density function will be presented in the more
general context of cumulated density functions.
In the following session, we use the binomial function to compute the probabilities of the binomial distribution function. The plot function generates the plot
which is presented in the previous figures, where the binomial distribution function
is computed with n = 10 and p = 0.5.
pr = binomial (0.5 , n )
plot (1: n +1 , pr )
In the figures 20, 21 and 21, we see that the bar plot representing the Galton
simulations converge toward line plot representing the binomial distribution function.
3.6
In this section, we present Scilab features which allows to generate random permutations. This problem is similar to the problem of shuffling a deck of cards. This
can be done by using the perms and grand functions.
The figure 24 presents the Scilab functions which allow to generate random permutations.
The perms function computes all the permutations of the given vector of indexes.
If the size of the input vector is n, then the size of the output matrix is n!n. In the
66
0.20
0.15
0.10
0.05
0.00
1
10
11
Figure 21: Simulation of a Galton board with n = 10 stages and 1000 simulations.
0.20
0.15
0.10
0.05
0.00
1
10
11
Figure 22: Simulation of a Galton board with n = 10 stages and 10 000 simulations.
binomial returns a matrix containing b(n, p, j) for j = 1, 2, . . . , n
cdfbin
computes the binomial cumulated density function
Figure 23: Scilab commands for the binomial distribution function
perms
returns all the permutations of the given vector
grand ( i , prm , v ) generate i random permutations of the column vector v
Figure 24: Scilab commands to generate random permutations
67
following session, we use the perms to compute all the permutations of the vector
(1, 2, 3).
--> perms (1:3)
ans =
3.
2.
3.
1.
2.
3.
2.
1.
1.
3.
1.
2.
1.
2.
1.
3.
2.
3.
When the number of elements in the array is small, lower than 5 for example,
we may generate all the possible permutations and randomly chose one of them. In
the following session, we use n=4 and store in p all the possible permutations of the
column vector (1, 2, 3, 4)T . Then we generate the random number j in the interval
[1, n] and use this to get the permutation stored in the j-th row of p.
-->n =
n =
4.
-->p =
p =
4.
4.
4.
[...]
1.
1.
1.
-->j =
j =
4.
-->v =
v =
4.
perms ( (1: n ) )
3.
3.
2.
2.
1.
3.
3.
2.
2.
4.
2.
3.
floor ( rand (
1.
2.
1.
4.
3.
4.
) * n) + 1
p (j ,:)
2.
1.
3.
The previous method is feasible, but only for very small values of n. Indeed, when
n grows, the required memory is grows as fast as n!, which is impractical for even
moderate values of n.
The following randperm function returns a random permutation of the numbers
in the interval [1, n]. It is based on the use of the grand function, which is used
to generate numbers in the uniform in the interval [0, 1[. Then, we use the gsort
function in order to sort the array and compute the order of the integers. Combined,
these two functions allows to generate a random permutation, at the cost of the sort
of a large number of values.
function p = randperm ( n )
[ ignore , p ] = gsort ( grand (1 ,n , " def " ) , " c " ," i " );
endfunction
For moderate values of n, the randperm function performs well, but for large values
of n, sorting the array might be expensive.
The grand function provides an algorithm to compute a random permutation
over a given array of values. This function is presented in figure 24. In the following
68
session, we use the grand function three times and get three independent permutations of the vector (1:10). A side effect of the call to this function is that is
updates the state of the uniform random number generator used by grand.
-->s = grand ( 1 , " prm " , (1:10) )
s =
5.
1.
4.
3.
8.
7.
-->s = grand ( 1 , " prm " , (1:10) )
s =
4.
2.
3.
9.
6.
8.
-->s = grand ( 1 , " prm " , (1:10) )
s =
7.
4.
8.
9.
5.
2.
2.
10.
6.
9.
10.
7.
5.
1.
10.
1.
6.
3.
The source code used by grand to produce random permutations has been implemented by Bruno Pinon. The algorithm is presented in [10], section 3.4.2 Random
sampling and shuffling. According to Knuth, this algorithm was first published by
Moses and Oakford [13] in 1963 and by Durstenfeld [5] in 1964.
The following genprm function is a simplified version of this algorithm. We assume that the size of the input matrix x is n. The algorithm proceeds by performing
n steps of the algorithm. At the step i, we compute an integer k as a random integer
uniform in the interval [i, n]. Then we exchange the values at indices i and k.
function x = genprm ( x )
n = size (x , " * " )
for i = 1: n
t = grand (1 ,1 , " unf " ,0 ,1)
k = floor ( t * ( n - i + 1)) + i
elt = x ( k )
x(k) = x(i)
x ( i ) = elt
end
endfunction
In the following session, we use the genprm function in order to generate three
independent permutations of the vector (1:10).
--> genprm
ans =
10.
--> genprm
ans =
7.
--> genprm
ans =
5.
( 1:10 )
2.
9.
( 1:10 )
8.
5.
3.
4.
4.
2.
( 1:10 )
9.
3.
1.
6.
2.
6.
3.
4.
10.
9.
7.
5.
1.
6.
1.
10.
8.
7.
8.
3.7
Acknowledgments
I would like to thank John Burkardt for his comments about the numerical computation of the permutation function. Thanks are also adressed to Samuel Gougeon
who suggested to improve the performance of the computation of the Pascal matrix.
70
Answers to exercises
5.1
Answer of Exercise 1.1 (Head and tail ) Assume that we have a coin which is tossed twice. We
record the outcomes so that the order matters, i.e. the sample space is = {HH, HT, T H, T T }.
Assume that the distribution function is uniform, i.e. each of the head and the tail have an equal
probability. The size of the sample space is #() = 4. The distribution is uniform so that
P (x) = 41 , for all x .
1. What is the probability for the event A = HH, HT, T H ? The number of elements in the
event is #(A) = 3. By proposition 1.9, the probability of the event A is
P (A) =
#(A)
3
= .
#()
4
(202)
2
1
#(A)
= = .
#()
4
2
(203)
Answer of Exercise 1.2 (Two dice) Assume that we are rolling a pair of dice. Assume that
each face has an equal probability. The sample space is
= {(i, j)/i, j = 1, 6}.
The size of the sample space is #() = 36. The distribution is uniform so that P (x) =
x .
(204)
1
36 ,
for all
(205)
The number of elements in the event is #(A) = 6. The probability is therefore P (A) =
6
1
36 = 6 .
2. What is the probability of getting a sum of 11 ? A sum of 11 corresponds to the event
A = {(5, 6), (6, 5)}.
(206)
The number of elements in the event is #(A) = 2. The probability is therefore P (A) =
1
2
36 = 18 .
3. What is the probability of getting a double one, i.e. snakeeyes ? The event is made of the
1
set A = (1, 1), with #(A) = 1. Its probability is P (A) = 36
.
Answer of Exercise 1.3 (Meres experiments) The two proofs are based on the fact that
one event and its complementary event are satisfying P (A) + P (Ac ) = 1. Since P (A) is complex
to compute directly, we compute instead P (Ac ) and then use P (A) = 1 P (Ac ).
The event A is that with four rolls of a die, at least one six turns up. The fact that a die is
rolled four times corresponds to the sample space
= {i / i = 1, 6}4 .
(207)
The size of the sample space is #() = 64 . Two make the computation more easy, we consider the
complementary event Ac , which is
Ac = {i / i = 1, 5}4 .
71
(208)
4
The size of Ac is #(Ac ) = 54 The probability of event A is therefore P (A) = 1 56 0.5177469.
Since P (A) > 1/2, de Mere wins consistently.
De Mere claims in 24 rolls of two dice, a pair of 6 would turn up (event B). The sample space
is now
= {(i, j) / i, j = 1, 6}24 .
(209)
The size of the sample space is #() = 3624 . The complementary event is
Ac = {(i, j) / i, j = 1, 5 or (6, i) / i = 1, 5 or (i, 6) / i = 1, 5}24 .
(210)
The size of Ac is #(Ac ) = (25 + 5 + 5)24 = 3524 . The probability of event A is therefore P (A) =
24
0.4914039 < 1/2, which explains why de Mere looses consistently.
1 35
36
De Mere claims that 25 rolls were necessary to make the game favorable (event C). The same
35 24
derivation leads to the probability P (A) = 1 36
0.5055315 > 1/2.
Can you compute with Scilab the probability for event A and a number of rolls equal to 1, 2,
3 or 4 ? The following Scilab session shows how to perform the computation.
-->i =1:4
i =
1.
2.
- - >1 -(5/6)^ i
ans =
0.1666667
3.
4.
0.3055556
0.4212963
0.5177469
Can you compute with Scilab the probability for the event B or C for a number of rolls equal to
10, 20, 24, 25, 30 ?
The following Scilab session shows how to perform the computation.
-->i =[10 20 24 25 30]
i =
10.
20.
24.
25.
- - >1 -(35/36)^ i
ans =
0.2455066
0.4307397
30.
0.4914039
0.5055315
0.5704969
Answer of Exercise 1.4 (Independent events) Assume that is a finite sample space.
Assume that the two events A, B are independent. Let us prove that
P (B|A) = P (B).
(211)
P (B A)
,
P (A)
(212)
[
X
P
Ei
P (Ei ).
(213)
i=1,n
i=1,n
72
(214)
The equality 214 is the result of proposition 1.7, which states that P (E1 E2 ) = P (E1 ) + P (E2 )
P (E1 E2 ). The equality 214 can be deduced from the fact that P (E1 E2 ) 0, by definition of
a probability.
The equality 213 is therefore true for n = 2. To finish the proof, we will use induction. Let us
assume that the inequality is true for n, and let us prove that it is true for n + 1. Let us denote
by Fn the set defined by
[
Fn =
Ei .
(215)
i=1,n
S
E
= P (Fn+1 ). We see that Fn+1 = En+1 Fn .
i
i=1,n+1
[
X
P (Fn ) = P
Ei
P (Ei ).
i=1,n
(216)
(217)
i=1,n
P (Ei ) =
i=1,n
P (Ei ),
(218)
i=1,n+1
P (E|D)P (D)
P (E|D)P (D) + P (E|Dc )P (Dc )
0.99 0.005
=
0.99 0.005 + 0.01 0.995
0.3322
=
(219)
(220)
(221)
with 4 significant digits. This might be surprising, since the probability of a false positive is 1 %,
which is quite small.
Consider the case where 5 % of the population has the disease.
P (D|E)
0.99 0.05
0.99 0.05 + 0.01 0.95
0.8389
73
(222)
(223)
P (E|Dc ) P (D|E)
0.100000
0.047391
0.050000
0.090494
0.010000 0.332215
0.005000
0.498741
0.001000
0.832632
0.000500
0.908674
0.000100
0.980295
Figure 25: Probabilities of having the disease, given that the test is positive
P (E|D) = 0.99 , P (D) = 0.005. The bold data is the data presented in the text of
the exercise, which corresponds to P (E|Dc ) = 0.01. The lower the probability of a
false positive, the more the result is accurate.
P (D)
P (D|E)
0.500000
0.990000
0.100000
0.916667
0.050000
0.838983
0.010000
0.500000
0.005000 0.332215
0.001000
0.090164
0.000500
0.047188
Figure 26: Probabilities of having the disease, given that the test is positive
P (E|D) = 0.99 , P (E|Dc ) = 0.01. The bold data is the data presented in the text
of the exercise, which corresponds to P (D) = 0.005. The more disease is rare, the
more the result is accurate.
with 4 significant digits. This shows that when more people have the disease (which is certainly
not desirable), the probability is higher.
Consider the case where the probability for a false positive is 0.1 % (but keep the probability
of a true positive equal to 99 %).
P (D|E)
0.99 0.05
0.99 0.05 + 0.001 0.95
0.9811
(224)
(225)
with 4 significant digits. This shows that when the probability of a false positive is lower, the
probability of having the disease, given that the test is positive, is higher.
The previous results might surprise a little bit, but are the result of the false positive. These
results are presented in the figures 25 and 26, which present the probability P (D|E) computed
with varying parameters P (E|Dc ) and P (D). The conclusion of this experiments is that a false
positive can reduce the reliability of the test if the disease is rare, or if the probability of a false
positive is high.
To make the previous computations clearer, consider the example where the population counts
1000 persons. Since 0.5 % of the population has the disease, this makes 0.005 1000 = 5 persons
who have the disease and 0.995 1000 = 995 who do not have the disease. From the 5 persons
who have the disease, there will be 0.99 5 = 4.95 persons who will have a positive test. Similarly,
from the 995 person who do not have the disease, there will be 0.01 995 = 9.95 persons who will
have a positive test. Therefore, given that the test is positive, the probability that the person has
74
the disease is
4.95
= 0.3322.
4.95 + 9.95
5.2
(226)
Answer of Exercise 2.1 (Recurrence relation of binomial ) Let us prove proposition 2.14, i.e.
that, for integers n > 0 and 0 < j < n, the binomial coefficients satisfy
n
n1
n1
=
+
.
(227)
j
j
j1
The proof is based on the expansion of the binomial formula. By definition of the binomial
number 152, we have
(n 1) . . . ((n 1) j + 1)
n1
(228)
=
j
j(j 1) . . . 1
=
(n 1) . . . (n j)
,
j(j 1) . . . 1
(229)
and
n1
j1
=
=
(n 1) . . . ((n 1) (j 1) + 1)
(j 1)(j 2) . . . 1
(n 1) . . . (n j + 1)
.
(j 1)(j 2) . . . 1
(230)
(231)
(232)
(233)
(234)
75
(235)
(236)
for j = 1, 2n . For each j = 1, 2n , we have Bj n+1 . We have found 2n sets (Aj )j=1,2n and 2n
sets (Bj )j=1,2n which are subsets of n+1 . The total number of subsets is therefore 2n +2n = 2n+1 ,
which concludes the proof.
Answer of Exercise 2.3 (Probabilities of Poker hands) Why does computing the probability
of a straight flush forces to take into account the probability of the royal flush ?
Consider the event where the 5 cards are in sequence with a single suit. This might be a straight
flush, if the last card in the sequence is not an ace. If the last card is an ace, then the hand is not
a straight flush anymore, since it is a royal flush. Therefore, the event which is associated with a
straight flush is a sequence of cards with a single suit, but not a royal flush. When computing the
probability of a straight flush, the simplest is to compute the probability of the event sequence of
cards with a single suit, and to remove the probability of the royal flush.
Explain other possible conflicts between Poker hands.
The following is a list of conflicts between Poker hands.
A straight is a hand with 5 fives in sequence, but at least two cards have different suit. If
the cards are in sequence and have the same suit, this is not a straight anymore, this is a
straight flush.
A double pair is when there are two pairs in the hand. But the two pairs must have different
suit, since, if they all have the same suit, this is not a double pair anymore, this is a four of
a kind.
A flush is a hand where all the cards have the same suit. But the cards must not be in
sequence, since they would not form a flush anymore, but would form a straight flush or a
royal flush.
Compute the probabilities for all Poker hands in figure 15.
The probability of a pair is computed
and,
as follows. There are 13 different ranks in the deck
4
4
once the rank is chosen, there are
different pairs of this rank Therefore, there are 13
2
2
different pairs in the deck. The remaining 3 cards in the hand are chosen so that there have a
different suit from the current pair. If not, there would be a three of a kind or
even
a four of a
12
kind. Therefore, their suit must be chosen in the remaining 12 suits. There are
different set
3
of
3 cards which have different ranks. For each card, there are 4 different suits, so that there are
12
.43 different combinations for the remaining 3 cards in the hand. The probability of a pair
3
is therefore
4
12
13.
.
.43
2
3
1098240
0.4225690
(237)
=
P (pair) =
2598960
52
.
5
To compute the probability of a double pair, we must take into account the fact that the two
pairs must have different suits. If not, that would be a four of a kind, instead of a double pair.
This is why we begin by choosing two different ranks in the set of 13 ranks available. Once done,
each pair can have two of the four suits. Therefore, the total number of double pair in the first 4
2
13
4
cards is
.
. The 5th card must be chosen from the 11 remaining ranks (if not, one of
2
2
the pair would become a three of a kind ). Once the rank of the 5th card is chosen, it can have one
of the 4 available suits. Therefore, there are 11 4 different choices for the 5th card. In the end,
76
(238)
The probability
of a three of a kind is computed as follows. There are 13
different
ranks and
4
4
there are
ways to choose 3 cards in a set of 4. Therefore, there are 13.
different ways
3
3
to select the 3 first cards of the same kind.
The last two cards must be chosen so that it does
12
not create a four of a kind. There are
different ways to select two ranks in the set of the
2
remaining 12 ranks. After that the ranks are chosen, each of the 2 cards
can have one of the 4
12
2
suits, so that there are 4 ways to select the suits. All in all, there are
.42 different ways to
2
select the last 2 cards. In the end, the probability for a three of a kind is
4
12
13.
.
.42
3
2
54912
=
0.0211285
(239)
P (three of a kind) =
2598960
52
.
5
The probability of a four of a kind and of a full house have already been computed in the text.
The probability of a straight flush is computed as follows. The following is a list of the 10
possible sequences:
1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9 6 7 8 9 10
7 8 9 10 J 8 9 10 J Q 9 10 J Q K 10 J Q K 1,
where all the cards can take the same among the 4 possible suits , , and . Therefore, the
total number of such hands is 4 10 = 40. In order for these hands to be straight flush, and not
royal flush, we must remove the 4 royal flush. Finally, the total number of straight flush is 4 10 4
and the probability of this hand is
P (straight flush) =
4 10 4
36
=
0.0000154
2598960
52
.
5
(240)
The probability of the royal flush is easy to compute since there are only 4 such hands in the
deck. The probability of the royal flush is therefore
4
4
P (royal flush) = =
0.0000015
2598960
52
.
5
(241)
The probability of a straight is computed as follows. The following is a list of the 10 possible
sequences:
1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9 6 7 8 9 10
7 8 9 10 J 8 9 10 J Q 9 10 J Q K 10 J Q K 1,
where each card can have one of the 4 suits , , and . Therefore, the total number of such
hands is 45 .10. But we require that a straight is neither a straight flush, nor a royal flush, so that
we have to remove these higher value hands. Using the previous computation for the straight flush,
we get 45 .10 4 10 different straight. Therefore, the probability of the straight is
P (straight) =
10200
45 .10 4 10
=
0.0039262
2598960
52
.
5
77
(242)
Name
total
no pair
pair
double pair
three of a kind
straight
flush
full house
four of a kind
straight flush
royal flush
Number
2598960
1302540
1098240
123552
54912
10200
5108
3744
624
36
4
Probability
1.
0.5011774
0.4225690
0.0475390
0.0211285
0.0039262
0.0019654
0.0014406
0.0002401
0.0000139
0.0000015
78
// one6inNtoss -//
Computes the probability of getting exactly one "6" in n toss of a fair
function bnpj = one6inNtoss ( n )
p = 1/6
q = 1-p
j = 1
bnpj = nchoosek (n , j ) * p ^ j * q ^( n - j )
endfunction
In the following session, we use the function one6inNtoss to compute the probability for n =
1, 2, . . . , 12.
--> for n = 1:12
--> b = one6inNtoss ( n );
--> mprintf ( " In %d toss , p ( one six )= %f \ n " ,n , b );
--> end
In 1 toss , p ( one six )=0.166667
In 2 toss , p ( one six )=0.277778
In 3 toss , p ( one six )=0.347222
In 4 toss , p ( one six )=0.385802
In 5 toss , p ( one six )=0.401878
In 6 toss , p ( one six )=0.401878
In 7 toss , p ( one six )=0.390714
In 8 toss , p ( one six )=0.372109
In 9 toss , p ( one six )=0.348852
In 10 toss , p ( one six )=0.323011
In 11 toss , p ( one six )=0.296094
In 12 toss , p ( one six )=0.269176
Answer of Exercise 2.5 (Probability of a flight crash) Assume that there are 20000 flights
of airplanes each day in the world. Assume that there is one accident every 500 000 flights. What
is the probability of getting exactly 5 crash in 22 days ?
There are several answers to this question, depending on the accuracy required.
1. Flight by flight. The first approach is based on the analysis of a Bernoulli process, where
each flight has a probability of crash.
2. Time decomposition. The second approach is based on the analysis of a Bernoulli process,
where each days has a probability of crash.
3. Poisson approximation. The third approach is based on the Poisson approximation of the
binomial distribution function.
This answer is based on the hypothesis that the flights are independent. Therefore, the process
can be considered as a Bernoulli process where each flight has a crash probability equal to p =
1
500000 . The number of steps in the Bernoulli process is equal to the number of flights n. In 22
days, the number of flights is n = 22 20000. The probability of getting exactly 5 crash in 22 days
is
22 20000 5 22200005
P (exactly 5 crash in 22 days) =
p q
0.0018241
(248)
5
The figure 28 presents the results for various number of crash.
What is the probability of at least 5 crash in 22 days ? The probability of getting more than
5 crash can be computed as depending the probability of getting 0, 1, 2, 3 or 4 crash. Therefore,
P (at least 5 crash in 22 days)
(249)
(250)
j=0,4
0.0021294
79
(251)
Event
0 crash in 22 days
1 crash in 22 days
2 crash in 22 days
3 crash in 22 days
4 crash in 22 days
5 crash in 22 days
6 crash in 22 days
7 crash in 22 days
8 crash in 22 days
9 crash in 22 days
10 crash in 22 days
Probability
0.41478255
0.36500937
0.16060408
0.04711041
0.01036424
0.00182409
0.00026753
0.00003363
0.00000370
0.00000036
0.00000003
Figure 28: Crash probabilities for 22 days with one crash every 500 000 flights and
20000 flights each day
Event
0 crash in 42 days
1 crash in 42 days
2 crash in 42 days
3 crash in 42 days
4 crash in 42 days
5 crash in 42 days
6 crash in 42 days
7 crash in 42 days
8 crash in 42 days
9 crash in 42 days
10 crash in 42 days
Probability
0.18637366
0.31310838
0.26301125
0.14728625
0.06186013
0.02078494
0.00581976
0.00139674
0.00029331
0.00005475
0.00000920
Figure 29: Crash probabilities for 42 days with one crash every 500 000 flights and
20 000 flights each day
What is the probability of getting exactly 3 crash in 42 days ? What is the probability of
getting at least 3 crash in 42 days ?
The same computations can be performed for 42 days, which represent 6 weeks. The figure 29
presents the results.
The probability of having exactly 3 crash in 42 days is
22 20000 3 22200003
P (exactly 3 crash in 42 days) =
p q
(252)
3
0.14728625
(253)
(254)
(255)
j=0,4
0.2375067
80
(256)
We now present another approach for the computation of the same problem. The method is
based on counting the number of crash during a given time unit, for example one day. We consider
that the process is a Bernoulli process, where each day is associated to the probability that exactly
1 crash occur, which is obviously an approximation, since it is possible that more than one crash
occur during one day. Since there is one crash every 500 000 flights, the probability that one flight
has no crash is p = 49999
50000 . By hypothesis, all flight are independent, therefore, the probability of
getting no accident in one day is
P (no crash in 1 day) =
49999
50000
20000
0.6703174.
(257)
49999
50000
20000
0.3296826.
49999 20000
50000
(258)
(259)
(260)
We see that the result is close, but different from the previous probability, which was equal to
0.00182409. In fact, the result depends on the time unit that we consider for our calculation.
Obviously, if we consider that the unit time is the 1/2 day, the formula is changed to
22 2 5 2225
P (exactly 5 crash in 22 days)
p q
(261)
5
(262)
where p = 1
49999 20000/2
50000
(263)
If we consider the hour as the time unit, we get P (exactly 5 crash in 22 days) = 0.0017973.
The third approach is based on the approximation that the binomial distribution function is
closely approximated when n is large by the Poisson distribution function, that is
b(n, p, j)
j
exp(),
j!
(264)
where = np. Here, the parameter is equal to = 22 20000/500000 = 0.88. The result is
P (exactly 5 crash in 22 days)
5
exp()
5!
0.0018241.
(265)
(266)
The three approaches are presented in figure 30. We can see that the flight by flight approach
gives a result which is very close to the Poisson approximation, with 6 common digits. The various
approaches based on time decomposition gives different results, with only 1 common digit. We can
check that smaller time units lead to results which are closer to the flight by flight approach.
Consider now all that the year is a sequence of 16 periods of 22 days (ignoring the 13 days left
in the year). What is the probability of having one period in the year which contains at least 5
crash ?
81
Event
Flight by flight
Time Unit = day
Time Unit = 1/2 day
Time Unit = hour
Poisson
Probability
0.00182409
0.0012366
0.0015155
0.0017973
0.0018241
Figure 30: Probability of having exactly 5 crash in 22 days with one crash every 500
000 flights and 20 000 flights each day - Different approaches
This decomposition corresponds to 1622 = 352 days (instead of the usual 365 days by regular
year), where the remaining 13 days are ignored. We want to compute the probability that one of
the 16 periods contains at least 5 crash. We have
P (one period contains at least 5 crash)
Since the periods are disjoints, the crash in all 16 periods are independents, so that
P (all periods contain less than 4 crash) = P (one period contain less than 4 crash)16 . (268)
Now the probability of having less than 4 crash in one period is equal to
P (less than 4 crash in 22 days)
(269)
16
(270)
16
It has already been computed that the probability of having at least 5 crash in one period of
22 days is
P (at least 5 crash in 22 days)
0.0021294,
(272)
(273)
1 0.997870616
(274)
0.0335316,
(275)
which is approximately 3%. This is much larger that the original probability 0.0021294 0.2% of
having at least 5 crash in 22 days.
The approach presented here is still too simplified. Indeed, we should instead consider the
probability that one period in the year contains at least 5 crash and consider all possible periods of
22 days in the year. In this problem, the periods are not disjoints anymore, so that the computations
performed earlier cannot be applied. This kind of problems involves a method called scan statistics
which will not be presented here (see [9, 6]).
Answer of Exercise 2.6 (Binomial function maximum) Consider the discrete distribution
function of a Bernoulli process, as defined by 2.16. Let us prove that
p nj+1
b(n, p, j) =
b(n, p, j 1),
(276)
q
j
82
n
pj1 q nj+1 ,
j1
(277)
(278)
(279)
(280)
(281)
(282)
We now plug the previous result into 279 and get 276, which concludes the proof.
Let us compute jm 1 so that b(n, p, j) is maximum, i.e. so that
b(n, p, jm ) b(n, p, j),
1 j n.
(283)
To find jm
, we consider the equality 276, which states that increasing values of j depends on
p nj+1
. If this term is greater than 1, then increasing values of j will produce increasing
q
j
values of b(n, p, j). Similarly, if this term is lower than 1, then increasing values of j will produce
decreasing values of b(n, p, j). Therefore, the index jm is the solution of the equation
p n jm + 1
1.
(284)
q
jm
This is equivalent to
p(n jm + 1) qjm ,
(285)
since q > 0 and jm > 0. We expand the previous inequality and get
pn pjm + p (1 p)jm = jm pjm ,
(286)
where the term pjm appears on both sides of the previous equation. This can be simplified into
pn + p jm ,
(287)
jm p(n + 1).
(288)
Therefore, the index j which makes the discrete distribution function b(n, p, j) maximum is
jm = [p(n + 1)],
(289)
where the [.] function is so that [x] is the largest integer lower or equal than x.
Consider the experiment presented in section 3.4, which consists in tossing a coin 10 times and
counting the number of heads. With a Scilab simulation, can you compute what is the number of
heads which is the most likely to occur ?
The following function returns the probability of getting j heads in 10 tosses of a coin. It is
based on the same method presented in section 3.4, replacing the number of successes (which was
equal to 5) by the variable j.
83
function p = tossingcoin ( j )
rand ( " seed " ,0)
nb = 10000;
success = 0;
for i = 1: nb
faces = floor ( 2 * rand (1 ,10) );
nbheads = sum ( faces );
if ( nbheads == j ) then
success = success + 1;
end
end
p = success / nb
endfunction
By using this function with different values of j, we can easily determine the value of j which
maximizes this probability. The following session performs a loop for j = 0, 10 and prints out the
computed probability for each value of j.
--> for j = 0:10
--> p = tossingcoin ( j );
--> mprintf ( " P ( j = %d )= %f \ n " ,j , p );
--> end
P ( j =0)=0.000800
P ( j =1)=0.009700
P ( j =2)=0.043900
P ( j =3)=0.119600
P ( j =4)=0.206800
P ( j =5)=0.250700
P ( j =6)=0.200000
P ( j =7)=0.114400
P ( j =8)=0.043700
P ( j =9)=0.008800
P ( j =10)=0.001600
We see that j = 5 maximizes the probability, which corresponds to the fact that the most probable
event is that the number of heads in 10 tosses of a coin is 5. Indeed, this corresponds to the
jm = np = 10 21 = 5 value that we just found by theory.
Answer of Exercise 2.7 (Binomial coefficients and Pascals triangle) Let a, b be two real
numbers and let n be a positive integer. Let us prove the binomial theorem which states that
X n
(a + b)n =
aj bnj .
(290)
j
j=0,n
(291)
The first term of his product an , the second term is an1 b, and so forth, until the last term bn . The
expansion can then be written as the sum of terms aj bnj , with j = 0, n, and where each term is
associated with a coefficient that we have to compute. Consider the term aj bnj and let us count
the number of times that his term will appear in the expansion. This is equivalent as choosing j
elements in a set of n elements. Indeed,
theorder of the elements does not count, since ab = ba.
n
Therefore, each term aj bnj will appear
times, which concludes the proof.
j
n
The binomial coefficients
can be written in a triangle, where each line corresponds to n
j
84
1.
1. 1.
L=
1. 2. 1.
1. 3. 3. 1.
1. 4. 6. 4. 1.
(292)
Let us use the binomial theorem in order to prove that the sum of the terms in the n-th row is 2n .
We apply the binomial theorem with a = b = 1 and get
X n
n
(1 + 1)
=
(293)
j
=
j=0,n
n
2 .
(294)
Let us prove that if the terms are added with alternating signs, then the sum is zero. We apply
the binomial theorem with a = 1 and b = 1 and we get
X n
(1 1)n =
(1)j (1)nj
(295)
j
j=0,n
n
n
n
= (1)0 (1)n
+ (1)1 (1)n1
+ . . . + (1)n (1)0
(296)
0
1
n
n
n
n
= (1)n
+ (1)n1
+ ... +
(297)
0
1
n
This last equality proves that, if the sum is beginning
with
the
lastterm, alternating the signs of
n
n
the terms leads to a zero sum. Additionally, we have
=
so that Pascals triangle has
j
nj
a symmetry property. If we use this symmetry property in the binomial expansion, we have
X n
n
(a + b) =
anj bj .
(298)
nj
j=0,n
+ . . . + (1)n
0
1
n
(299)
(300)
(301)
This last equality proves that, if the sum is beginning with the first term, alternating the signs of
the terms leads to a zero sum.
Binomial coefficients can also be represented in a matrix called Pascals matrix, where the
binomial coefficients are stored in the anti-diagonals of the matrix. The following matrix is Pascals
matrix of order 5 :
1. 1. 1.
1. 1.
1. 2. 3.
4. 5.
S = 1. 3. 6. 10. 15.
(302)
85
function c = pascallow ( n )
c = zeros (n , n );
for i = 1: n
c (i ,1: i ) = nchoosek (i -1 ,(1: i ) -1);
end
endfunction
In the following session, we use the pascallow function to check that we get the matrix presented
in 292.
--> pascallow (5)
ans =
1.
0.
2.
1.
3.
3.
4.
6.
5.
10.
0.
0.
1.
4.
10.
0.
0.
0.
1.
5.
0.
0.
0.
0.
1.
In order to compute Pascals symetric matrix, some algebra is required so that the matrix
elements Sij are filled anti-diagonal by anti-diagonal, where an anti-diagonal is associated with a
constant sum i + j. The following function computes Pascals symetric matrix of order n.
function c = pascalsym ( n )
c = zeros (n , n );
for i = 1: n
c (i ,1: n ) = nchoosek ( i +(1: n ) -2 ,i -1);
end
endfunction
The following session shows a sample use of this function in order to check that we get the same
result as presented in equation 302.
--> pascalsym (5)
ans =
1.
1.
1.
1.
2.
3.
1.
3.
6.
1.
4.
10.
1.
5.
15.
1.
4.
10.
20.
35.
1.
5.
15.
35.
70.
We can additionnaly define Pascals upper triangular matrix as in the following function.
function c = pascalup ( n )
c = zeros (n , n );
for i = 1: n
c (i , i : n ) = nchoosek ( ( i : n ) -1 , i -1 );
end
endfunction
In the following session, we compute Pascals upper triangular matrix of order 5.
--> pascalup ( 5 )
ans =
1.
1.
1.
0.
1.
2.
0.
0.
1.
0.
0.
0.
0.
0.
0.
1.
3.
3.
1.
0.
1.
4.
6.
4.
1.
The lower, upper and symetric Pascal matrix are related by the equality L U = S, as shown
in the following session.
86
-->L
-->U
-->S
-->L
ans
=
=
=
*
pascallow ( 5 );
pascalup ( 5 );
pascalsym ( 5 );
U - S
=
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
(303)
j=0,n
To help ourselves to prove this result, we consider a set A with 2n elements, where n elements are
red and n elements are blue. Let us compute the number of ways to choose n elements in this set.
For example, we consider the case n = 3 so that the set is
A = {R1 , R2 , R3 , B1 , B2 , B3 } .
(304)
We are searching subsets Ai A, where i = 1, imax , where imax is the positive integer to be
computed. To organize our computation, we order the subsets depending on the number of red
balls in the subset. The following is the list of all possible subsets of size 3 with no red element.
A1 = {B1 , B2 , B3 } .
(305)
The following is the list of all possible subsets of size 3 with 1 red element (and, therefore, 2 blue
elements).
A2 = {R1 , B1 , B2 } ,
A3 = {R2 , B1 , B2 } ,
A4 = {R3 , B1 , B2 } ,
(306)
A5 = {R1 , B2 , B3 } ,
A6 = {R2 , B2 , B3 } ,
A7 = {R3 , B2 , B3 } ,
(307)
A8 = {R1 , B1 , B3 } ,
A9 = {R2 , B1 , B3 } ,
A10 = {R3 , B1 , B3 } .
(308)
The other
can be computed with the same method so
that we finally find that there are,
subsets
6
2n
indeed,
= 20 subsets. Therefore, we write the term
as the sum of the subsets where
3
n
there are j red elements, where 1 j n. We have
X
2n
=
Cj ,
(309)
n
j=0,n
n
where Cj is the number of subsets of A with n elements where j elements are red. There are
j
n
ways to choose the j red elements and
ways to choose the nj blue elements so that Cj =
nj
n
n
n
n
. But the symetry property of the binomial function states that
=
,
j
nj
j
nj
which leads to
2
n
Cj =
.
(310)
j
Finally, the previous equality can be plugged into 309 so that the equality 303 holds true, which
concludes the proof.
87
Answer of Exercise 2.9 (Earthquakes and predictions) Assume that a person predicts the
dates of major earthquakes (with magnitude larger than 6.5 or with a large number of deaths,
etc...) in the world during 3 years, i.e. in a period of 1096 days. Assume that the specialist
predicts 169 earthquakes. Assume that, during the same period, 196 major earthquakes really
occur, so that 33 earthquakes were correctly predicted by the specialist. What is the probability
that the earthquakes are predicted by chance ?
We consider the set of m = 1096 days, where k = 196 days are earthquakes and m k =
1096 196 are not earthquake days. In this set, we are picking n = 169 days, where x = 33 days
are earthquakes.
The probability of selecting x earthquake days is given by the hypergeometric distribution
function defined by
k
mk
x
nx
.
(311)
P (X = x) = h(x, m, k, n) =
m
n
In our particular situation, we have
P (X = 33)
196
1096 196
33
169 33
1096
169
0.0705625,
(312)
(313)
which is approximately 7 %.
To know if the prediction is based on chance, we perform the computation, with Scilab, of
all the probabilities with x = 0, 1, . . . , 169. The following script allows to compute the required
probability and to draw the plot which is presented in figure 31.
// The number of days in three years
m = 1096
// The number of days selected
n = 169
// The number of earthquake days in three years
k = 196
// The number of earthquake days selected
x = 33
// The probability of picking 169 days , where 33 are earthquakes .
p = hygepdf ( x , m , k , n )
// Plot the distribution
xdata = zeros ( n +1);
pdata = zeros ( n +1);
for i = 0: n
xdata ( i +1) = i ;
pdata ( i +1) = hygepdf ( i , m , k , n );
end
plot ( xdata , pdata )
f = gcf ();
f . children . title . text = " Probability of random predictions " ;
f . children . x_label . text = " Number of earthquakes " ;
f . children . y_label . text = " Probability " ;
Obviously, the prediction of the specialist appears to be a random choice, since the number
of earthquakes predicted corresponds to the maximum probability.
88
0.08
0.07
Probability
0.06
0.05
0.04
0.03
0.02
0.01
0.00
0
20
40
60
80
100
120
140
160
180
Number of earthquakes
Figure 31: Probability of having x earthquake days while choosing n = 169 days
from m = 1096 days, where k = 196 days are earthquake days.
Answer of Exercise 2.10 (Log-factorial function) The following Scilab implementation is
adapted from a Matlab source code written by John Burkardt[3]. The following factoriallog_sum
function uses the sum and ln functions to perform a fast computation of fln (n) = log(n!) from
equation 193.
function value = factoriallog_sum ( n )
value = sum ( log (2: n ));
endfunction
The previous implementation has the advantage of relying only on the logarithm function. But the
factoriallog_sum function does not take matrix input arguments. The main issue is that more
memory is required, making the computation intractable for values of n larger than 105 . Finally,
the function might generate inaccurate results for large values of n, caused by the accumulation of
rounding errors in the sum.
References
[1] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with
Formulas, Graphs, and Mathematical Tables. Dover Publications Inc., 1972.
[2] George E. Andrews, Richard Askey, and Ranjan Roy. Special Functions. Cambridge University Press, Cambridge, 1999.
[3] John Burkardt. Probability density functions. http://people.sc.fsu.edu/
~burkardt/m_src/prob/prob.html.
[4] Georges Charpak and Henri Broch. Devenez sorciers, devenez savants. Odile
Jacob, 2002.
89
90
Index
Bernoulli, 30
binomial, 25
combination, 25
combinatorics, 11
complementary, 2
conditional
distribution function, 9
probability, 9
disjoint, 2
event, 3
factorial, 13, 21
fair die, 8
gamma, 15
grand, 34
intersection, 2
outcome, 3
permutation, 12
permutations, 23
poker, 28
rand, 34
random, 3
rank, 28
sample space, 3
seed, 35
subset, 2
suit, 28
tree diagram, 12
uniform, 8
union, 2
Venn, 2
91