Documente Academic
Documente Profesional
Documente Cultură
Perl Implementation
Confidence interval is abbreviated as CI. In this new article (part of our series
on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discuss our popular model-free confidence
our (open source) intellectual property sharing. This technique has the following advantages:
This is part of our series on data science techniques suitable for automation,
usable by non-experts. The next one to be detailed (with source code) will be
our Hidden Decision Trees.
Figure 1: Confidence bands based on our CI (bold red and blue curves) - Comparison with traditional normal model (light red anf blue curves)
Figure 1 is based on simulated data that does not follow a normal distribution :
see section 2 and Figure 2 in this article. Classical CI's are just based on 2
parameters: mean and variance. With the classical model, all data sets with
same mean and same variance have same CI's. To the contrary, our CI's are
based on k parameters - average values computed on k different bins - see
next section for details. In short, they are better predictive indicators when
your data is not normal. Yet they are so easy to understand and compute, you
don't even need to understand probability 101 to get started. The attached
spreadsheet and Perl scripts have all computations done for you.
1. General Framework
We assume that we have n observations from a continuous or discrete varia-
(1 m k/2, with p(1) being the minimum average). Then our CI is defined as
follows:
the same data set) will be between the lower and upper bounds of the CI. Note
that this method produces asymetrical CI's. It is equivalent to designing per-
If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's
(with different values of m), with confidence level close to 0.95. Then inperpolate or extrapolate the lower and upper bounds to get a CI with 0.95 confi-
dence level. The concept is easy to visualize if you look at Figure 1. Also, do
proper cross-validation: split your data in two; compute CI's using the first
half, and test them on the other half, to see if they still continue to have sense
(same confidence level, etc.)
CI's are extensively used in quality control, to check if a batch of new products
(say, batteries) have failure rates, lifetime or other performance metrics that
are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each
indication about how accurate the score is. Very small confidence levels (that
is, narrow CI's) corresponds to data well understood, with all sources of vari-
ances perfectly explained. Converserly, large CI's mean lot's of noise and high
individual variance in the data. Finally, if your data is stratified in multiple
heterogeneous segments, compute separate CI's for each strata.
this CI concept, as well as the concept of hypothesis testing (derived from CI)
explained below in section 3.
When Big Data is Useful
If you look closely at Figure 1, it's clear that you can't compute accurate CI's
with a high (above 0.99) level, with just a small sample and (say) k=100 bins.
The higher the level, the more volatile the CI. Typically, an 0.999 level
level CI's are needed especially in the context of assessing failure rates, food
quality, fraud detection or sound statistical litigation. There are ways to work
with much smaller samples by combining 2 tests, see section 3.
An advantage of big data is that you can create many different combinations of
k bins (that is, test many values of m and k) to look at how the confidence
bands in Figure 1 change depending on the bin selection - even allowing you to
create CI's for these confidence bands, just like you could do with Bayesian
models.
are in perfect random order: read A New Big Data Theorem section in this article for an explanation why reshuffling is necessary (look at the second theo-
rem). In short, you want to create bins that have the same mix of values: if the
first half of your data set consisted of negative values, and the second half of
positive values, you might end up with bins either filled with positive or negative values. You don't want that; you want each bin to be well balanced.
Reshuffling Step
Unless you know that your data is in an arbitrary order (this is the case most
RAND() to make sure that all random numbers are integers with the same
alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.
If the order in your data set is very important, just add a column that has the
original rank attached to each observation (in your initial data set), and keep
it through the res-shuffling process (after each observation has been assigned
to a bin), so that you can always recover the original order if necessary, by
sorting back according to this extra column.
The Spreadsheet
Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as
well as all CI computations, and more. The spreadsheet illustrates many not so
well known but useful analytic Excel functions, such as: FREQUENCY, PER-
vals tab. You can modify the data in column B, and all CI's will automatically
be re-computed. Beware if you change the number of bins (cell F2): this can
screw up the RANK function in column J (some ranks will be missing) and then
screw up the CI's.
For other examples of great spreadsheet (from a tutorial point of view), check
the Excel section in our data science cheat sheet.
Simulated Data
The simulated data in our Excel spreadsheet (see the data simulation tab), represents a mixture of two uniform distributions, driven by the parameters in
the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure
2) were stored in column D, and were subsequently hard-copied into column B
in the Confidence Interval (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the
RAND Excel function are automatically updated, changing everything and
vides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.
variance in your data sets. And then compute intervals that contain
99%, 95%, or 90% off all the scaled averages: these are your
standard Gaussian CI's.
ments in all cells in columns S and T), leads to similar CI's. Indeed, traditional
CI's have been designed for the mean, while ours are designed for bin averages (that is, batch averages in quality control), or even individual values
Perl Code
Here's some simple source code to compute CI for given m and k:
extinct) that nobody but statisticians understand, here is an easy way to per-
form statistical tests. The method below is part of what we call rebel statistical science.
Let's say that you want to test, with 99.5% confidence (level = 0.995),
12.5% when indeed it is 13%), maybe to save some money. The test to perform
is as follows: check out 100 bottles from various batches, and compute an
0.995-level CI for alcohol content. Is 12.5% between the upper and lower
bounds? Note that you might not be able to get an exact 0.995-level CI if your
sample size n is too small (say n=100), you will have to extrapolate from
lower level CI's, but the reason here to use a high confidence level is to give
the defendant the benefit of the doubt rather than wrongly accusing him based
on a too small confidence level. If 12.5% is found inside even a small 0.50-level
CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it
will be inside an 0.995-level CI, because these CI's are nested (see Figure 1 to
understand these ideas). Likewise, if the wine truly has a 13% alcohol content,
a tiny 0.03-level CI containing the value 13% will be enough to prove it.
One way to better answer these statistical tests (when your high-level CI's
4. Miscellaneous
We include two figures in this section. The first one is about the data used in
our test and Excel spreadsheet, to produce our confidence intervals. And the
other figure shows the theorem that justifies the construction of our confidence intervals.
Figure 2: Simulated data used to compute CI's: asymmetric mixture of nonnormal distrubutions
on robust techniques for automated data science) we describe an implementation both in Excel and Perl, and discuss our popular model-free confidence
our (open source) intellectual property sharing. This technique has the following advantages:
This is part of our series on data science techniques suitable for automation,
usable by non-experts. The next one to be detailed (with source code) will be
our Hidden Decision Trees.
Figure 1: Confidence bands based on our CI (bold red and blue curves) - Comparison with traditional normal model (light red anf blue curves)
Figure 1 is based on simulated data that does not follow a normal distribution :
see section 2 and Figure 2 in this article. Classical CI's are just based on 2
parameters: mean and variance. With the classical model, all data sets with
same mean and same variance have same CI's. To the contrary, our CI's are
based on k parameters - average values computed on k different bins - see
next section for details. In short, they are better predictive indicators when
your data is not normal. Yet they are so easy to understand and compute, you
don't even need to understand probability 101 to get started. The attached
spreadsheet and Perl scripts have all computations done for you.
1. General Framework
We assume that we have n observations from a continuous or discrete varia-
(1 m k/2, with p(1) being the minimum average). Then our CI is defined as
follows:
the same data set) will be between the lower and upper bounds of the CI. Note
that this method produces asymetrical CI's. It is equivalent to designing per-
If you can't find m and k to satisfy level = 0.95 (say), then compute a few CI's
(with different values of m), with confidence level close to 0.95. Then inperpolate or extrapolate the lower and upper bounds to get a CI with 0.95 confi-
dence level. The concept is easy to visualize if you look at Figure 1. Also, do
proper cross-validation: split your data in two; compute CI's using the first
half, and test them on the other half, to see if they still continue to have sense
(same confidence level, etc.)
CI's are extensively used in quality control, to check if a batch of new products
(say, batteries) have failure rates, lifetime or other performance metrics that
are within reason, that are acceptable. Or if wine advertised with 12.5% alcohol content has an actual alcohol content reasonably close to 12.5% in each
indication about how accurate the score is. Very small confidence levels (that
is, narrow CI's) corresponds to data well understood, with all sources of vari-
ances perfectly explained. Converserly, large CI's mean lot's of noise and high
individual variance in the data. Finally, if your data is stratified in multiple
heterogeneous segments, compute separate CI's for each strata.
this CI concept, as well as the concept of hypothesis testing (derived from CI)
explained below in section 3.
When Big Data is Useful
If you look closely at Figure 1, it's clear that you can't compute accurate CI's
with a high (above 0.99) level, with just a small sample and (say) k=100 bins.
The higher the level, the more volatile the CI. Typically, an 0.999 level
level CI's are needed especially in the context of assessing failure rates, food
quality, fraud detection or sound statistical litigation. There are ways to work
with much smaller samples by combining 2 tests, see section 3.
An advantage of big data is that you can create many different combinations of
k bins (that is, test many values of m and k) to look at how the confidence
bands in Figure 1 change depending on the bin selection - even allowing you to
create CI's for these confidence bands, just like you could do with Bayesian
models.
The first step is to re-shuffle your data to make sure that your observations
are in perfect random order: read A New Big Data Theorem section in this article for an explanation why reshuffling is necessary (look at the second theo-
rem). In short, you want to create bins that have the same mix of values: if the
first half of your data set consisted of negative values, and the second half of
positive values, you might end up with bins either filled with positive or negative values. You don't want that; you want each bin to be well balanced.
Reshuffling Step
Unless you know that your data is in an arbitrary order (this is the case most
RAND() to make sure that all random numbers are integers with the same
alphabetically (without knowing it) is a source of many bugs in software engineering. This little trick helps you avoid this problem.
If the order in your data set is very important, just add a column that has the
original rank attached to each observation (in your initial data set), and keep
it through the res-shuffling process (after each observation has been assigned
to a bin), so that you can always recover the original order if necessary, by
sorting back according to this extra column.
The Spreadsheet
Download the Excel spreadsheet. Figures 1 and 2 are in the spreadsheet, as
well as all CI computations, and more. The spreadsheet illustrates many not so
well known but useful analytic Excel functions, such as: FREQUENCY, PER-
vals tab. You can modify the data in column B, and all CI's will automatically
be re-computed. Beware if you change the number of bins (cell F2): this can
screw up the RANK function in column J (some ranks will be missing) and then
screw up the CI's.
For other examples of great spreadsheet (from a tutorial point of view), check
the Excel section in our data science cheat sheet.
Simulated Data
The simulated data in our Excel spreadsheet (see the data simulation tab), represents a mixture of two uniform distributions, driven by the parameters in
the orange cells F2, F3 and H2. The 1,000 original simulated values (see Figure
2) were stored in column D, and were subsequently hard-copied into column B
in the Confidence Interval (results) tab (they still reside there), because otherwise, each time you modify the spreadsheet, new deviates produced by the
RAND Excel function are automatically updated, changing everything and
vides a great example of data that causes big problems with traditional statistical science, as described in our following subsection.
variance in your data sets. And then compute intervals that contain
99%, 95%, or 90% off all the scaled averages: these are your
standard Gaussian CI's.
ments in all cells in columns S and T), leads to similar CI's. Indeed, traditional
CI's have been designed for the mean, while ours are designed for bin averages (that is, batch averages in quality control), or even individual values
extinct) that nobody but statisticians understand, here is an easy way to per-
form statistical tests. The method below is part of what we call rebel statistical science.
Let's say that you want to test, with 99.5% confidence (level = 0.995),
12.5% when indeed it is 13%), maybe to save some money. The test to perform
is as follows: check out 100 bottles from various batches, and compute an
0.995-level CI for alcohol content. Is 12.5% between the upper and lower
bounds? Note that you might not be able to get an exact 0.995-level CI if your
sample size n is too small (say n=100), you will have to extrapolate from
lower level CI's, but the reason here to use a high confidence level is to give
the defendant the benefit of the doubt rather than wrongly accusing him based
on a too small confidence level. If 12.5% is found inside even a small 0.50-level
CI (which will be the case if the wine is truly 12.5% alcohol), then a fortiori it
will be inside an 0.995-level CI, because these CI's are nested (see Figure 1 to
understand these ideas). Likewise, if the wine truly has a 13% alcohol content,
a tiny 0.03-level CI containing the value 13% will be enough to prove it.
One way to better answer these statistical tests (when your high-level CI's
4. Miscellaneous
We include two figures in this section. The first one is about the data used in
our test and Excel spreadsheet, to produce our confidence intervals. And the
other figure shows the theorem that justifies the construction of our confidence intervals.
Figure 2: Simulated data used to compute CI's: asymmetric mixture of nonnormal distrubutions