Documente Academic
Documente Profesional
Documente Cultură
Abstract
We study the probability distribution of the biggest gap between
two consecutive choices in an m-tuple of distinct integers, assuming the
m-tuple is chosen uniformly from within the range 1, . . . , n > m, or, in
other words, in the winning set of an m/n lottery game. We further
study the asymptotic behavior of its mean and variance.
1 Introduction
The game of lottery is popular the world over. In the m/n lottery game, the
player chooses m integers from among the integers 1, . . . , n > m, the order of
choice being unimportant; the lottery organizers choose publicly m numbers
uniformly at random in the same way, and if they turn out to be the same
numbers as the ones the player chose, the player wins. The specific values
of m and n vary from lottery to lottery, but m = 6 and n = 49 appears to
be a popular choice for many national lotteries. In practice, the lottery is
hardly ever a win/lose game, as various lesser winning prizes are commonly
awarded to the player according to how many chosen numbers the player and
the organizers have in common.
The media usually publish the winning set of numbers, along with (simple)
statistics on the number of times each particular number from 1 to n has
appeared in the winning set. The lottery is, however, an opportunity to carry
out much more sophisticated statistical/probabilistic studies, such as the ones
1
The author is also affiliated with the School of Electronic, Electrical & Mechanical
Engineering, University College Dublin, Ireland.
2726 K. Drakakis
2 The result
Let us choose an m-tuple (X1 , . . . , Xm ) uniformly randomly out of the m-
tuples in the range 1, . . . , n > m; without loss of generality we assume Xi < Xj
whenever i < j. For a given integer k, we define the event Xi+1 −Xi ≥ k by Ai ,
i = 1, . . . , m−1. We define the new random variable Y = max (Xi+1 −Xi ),
i=1,... ,m−1
and we seek p(k, m, n) = P(Y = k).
m−1
We first observe that there are ways to choose l out of m − 1 indices.
l
In order to compute N(Ai1 ∩ . . . ∩ Ail ), we note that, for any set of choices
satisfying the event Ai1 ∩ . . . ∩ Ail , choices i1 , . . . , il are each followed by at
least k − 1 integers, none of which lies among our choices. An equivalent way
then for making such a choice is to consider n − l(k − 1) “white balls” ordered
on a line, choose m among them at random, insert k − 1 “black balls” in the
line after each one of the choices i1 , . . . , il , and finally drop the colors and
order the balls consecutively from 1 to n. It follows that
n − l(k − 1)
N(Ai1 ∩ . . . ∩ Ail ) = , (3)
m
m−1
m − 1 n − l(k − 1)
N(∪m−1
i=1 Ai ) = l−1
(−1) (4)
l=1
l m
and that
P (k, m, n) = P(∪m−1
i=1 Ai ) =
m−1
m − 1 n − l(k − 1) n
= (−1)l−1 . (5)
l=1
l m m
n
In the expressions above we follow the usual convention that = 0 if
m
n < m. The following facts are easily verified:
• The nonzero
terms in
the sum for a given k > 1 correspond to 1 ≤ l ≤
n−m
min m − 1, and to 1 ≤ l ≤ m − 1 for k = 1.
k−1
p(k, m, n) = P (k, m, n) − P (k + 1, m, n) =
m−1
m − 1 n − l(k − 1) n − lk n
= (−1)l−1 − (6)
l=1
l m m m
(n − lk − 1)!
≈ (n − lk)m−1
+ , (9)
(n − lk − m)!
nm+1 (n − m)!
≈ n, (12)
n!
valid for n m, we finally obtain
n (−1)l−1 m − 1 n 1
m−1 m−1
μ≈ = . (13)
m + 1 l=1 l l m + 1 l=1 l
On the maximal distance between consecutive choices 2729
m m
m l n−l
m m m l n−l
(x + y) = xy =y + xy ⇔
l=0
l l=1
l
m
(x + y)m − y m m l−1 n−l
= x y . (14)
x l=1
l
2n2 m − 1 (−1)l−1
m−1
2
E(X ) ≈ =
(m + 1)(m + 2) l=1 l l2
2n2 1
m−1 i
1
= . (18)
(m + 1)(m + 2) i=1 i j=1 j
2730 K. Drakakis
The asymptotic forms for μ and σ we have obtained are still rather com-
plicated, but simpler forms can be obtained if we focus on large values of m
(namely assuming that m 1), by simplifying the discrete sums appearing in
the two expressions (note that the current expressions are accurate for all m,
as long as n m). We will make use of the well known asymptotic expansion
[4]
n
1 1 1 1
H(n) = = ln(n) + γ + − 2
+ + O(n−6 ), (20)
i=1
i 2n 12n 120n4
but, in order to simplify σ 2 , we also need an asymptotic expression for the sum
H(i)
m−1
:
i=1
i
m−1
H(i) ln(i) γ
m−1
1 1 1
= + + 2− + + ... ≈
i=1
i i=1
i i 2i 12i3 120i5
m−1
ln(i) ln(i)
m−1
1
≈ + γH(m − 1) + α ≈ + γ(γ + ln(m − 1)) + α, (22)
i=1
i i=2
i 2
where
1 1
α = ζ(2) − ζ(3) + ζ(5) + . . . (23)
6 60
On the maximal distance between consecutive choices 2731
m−1
ln(i) ln(i + 1)
m−1 m−2
ln(x + 1) 1 ln(m)
= ≈ dx − =
i=2
i i=1
i+1 0 x+1 2 m
1 2 ln(m)
= ln (m) − , (24)
2 m
n2 n2
σ2 ≈ (α + γ 2
) ≈ 1.795 . (27)
(m + 1)2 (m + 1)2
To sum up,
• the asymptotic behavior of its mean and variance is given by (13) and
(19), respectively, assuming n m;
Figure 1 shows how the mean and the standard deviation of Y , along with
their asymptotic estimates (21) and (27), vary with m and n. For fixed m and
varying n, the estimates are linear in n and are very good approximations of
the actual quantities, only off by a small constant. For fixed n and varying
m, however, we verify that the estimates are accurate only as long as m n;
beyond this, the curves diverge.
2732 K. Drakakis
60 18
Mean value Standrad deviation
Estimation 16 Estimation
50
14
40 12
10
30
8
20 6
4
10
2
0 0
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160
n n
20 12
Mean value Standard deviation
18 Estimation Estimation
10
16
14
8
12
10 6
8
4
6
4
2
2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
m m
Figure 1: Mean value (left) and variance (right) of the probability distribution
of Y along with the estimates (13) and (19) for m = 6 and 6 ≤ n ≤ 150 (top
row) and for n = 50 and 2 ≤ m ≤ 50 (bottom row).
which grows less and less accurate as the product kl approaches n, we obtain
min(m−1,n/k)
m l−1 m − 1 −lk(m−1)/n
p(k, m, n) ≈ (−1) l e . (29)
n l=0
l
On the maximal distance between consecutive choices 2733
In order to sum this in closed form, we use m − 1 as the upper limit in the
sum independently of k (which is a rather extreme approximation step):
min(m − 1, n/k) ≈ m − 1 for all k. (30)
Revisiting the binomial identity in (14), using m − 1 in place of m and differ-
entiating with respect to x yields
m − 1
m−1
m−2
(m − 1)(x + y) = lxl−1 y m−1−l . (31)
l=1
l
The estimate for p agrees with the cruder previous estimate (37), with the
exception of the factor involving the harmonic numbers.
Figure 2 compares the various distributions for Y we have proposed so far,
namely the exact distribution (6), the approximation (33), and the negative
binomial fit (39), for two sets of parameters: for the first set n = 150 and m = 6
we see that the binomial fit is almost identical to the exact distribution, while
for the second set n = 50 and m = 10 we see that the binomial fit is a good
but not perfect approximation. In both cases, the approximation (33) is very
far from the exact distribution, though its peak (most likely value) is quite
accurate, as we saw above.
On the maximal distance between consecutive choices 2735
0.02 0.1
0.08
0.015
0.06
0.01
0.04
0.005
0.02
0.04
n=90, m=5
Histogram
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0 10 20 30 40 50 60 70 80 90
k
Figure 3: Comparison of the histogram of Y over the winning sets of the Italian
lottery, which uses n = 90 and m = 5, versus the theoretical distribution of Y
for this parameters as given by (6).
P(Y ≥ K, Z ≥ k) =
m−1
m − 1 n − (m − 1)(k − 1) − l(K − k) n
= (−1)l−1 , (43)
l m m
l=1
P(Y ≥ K, Z ≥ k, X1 ≥ s, Xm ≤ S) =
= P(Y ≥ K, Z ≥ k|X1 ≥ s, Xm ≤ S)P(X1 ≥ s, Xm ≤ S). (46)
On the maximal distance between consecutive choices 2737
P(Y ≥ K, Z ≥ k|X1 ≥ s, Xm ≤ S) =
m−1
m−1 S −s+1−(m−1)(k−1)−l(K −k) n
= (−1)l−1 . (47)
l=1
l m m
6 Conclusion
Assuming (X1 , . . . , Xm ) is an integer m-tuple without repeated entries chosen
uniformly from within the range 1, . . . , n, so that n ≥ m, and so that Xi < Xj
whenever i < j, what is the probability distribution of Y = max (Xi+1 −
i=1,... ,m−1
Xi )? We found the exact form of this distribution in the form of a sum involv-
ing binomial coefficients, which we were unable to simplify further. In terms
of the popular game of chance known as the lottery, we seek the maximal dis-
tance between two consecutive choices in the winning set of numbers of the
lottery (assuming the choices are sorted increasingly and that the winning set
is chosen uniformly randomly).
We then turned our attention to the calculation/estimation of the mean
value and standard deviation of this probability distribution. We were suc-
cessful in finding asymptotic formulas valid for n m, both linear in n,
which we simplified even further under the assumption that n m 1. We
also attempted to find a simplified expression approximating the probability
distribution, and some asymptotic manipulation suggested that the standard
negative binomial distribution might be a good candidate: fitting the negative
binomial with the correct mean and standard deviation, we observed that the
fit is indeed good as long as n m.
We tested our result against real lottery data collected from the website
of the Italian national lottery: the resulting histogram matched perfectly the
2738 K. Drakakis
distribution we derived. This result can be used to test for tampering with
lottery results, based on the law of large numbers, along the lines we followed
in [3]. We finally formulated some joint probability distributions involving Y
and Z = min (Xi+1 − Xi ).
i=1,... ,m−1
References
[1] Ch. Charalambides. “Enumerative combinatorics.” Chapman &
Hall/CRC, 2002.
[6] N. Henze and H. Riedwyl. “How to win more: Strategies for increasing a
lottery win.” A.K. Peters, Natick, Massachusetts 1998.
[7] L. Holst. “On discrete spacings and the Bose-Einstein distribution.” Con-
tributions to probability and statistics, Hon. G. Blom, 1985, pp. 169–177.