Sunteți pe pagina 1din 7

StatisticsResearchLetters(SRL)Volume42015www.srljournal.

org
doi:10.14355/srl.2015.04.003

GapMeasureTestswithApplicationsto
DataIntegrityVerification
TrucLe*1,JeffreyUhlmann2
DepartmentofComputerScience,UniversityofMissouriColumbia,201EBW,Columbia,MO,USA
*1

tdlxqb@mail.missouri.edu;2uhlmannj@missouri.edu

Abstract
In this paper we propose and examine gap statisics for assessing uniform distribution hypotheses. We provide examples
relevanttodataintegritytestingforwhichmaxgapstatisticsprovidegreatersensitivitythanchisquare ( 2 ) ,thusallowingthe
newtesttobeusedinplaceoforasacomplementto 2 testingforpurposesofdistinguishingalargerclassofdeviationsfrom
uniformity.Weestablishthattheproposedmaxgaptesthasthesamesequentialandparallelcomputationalcomplexityas 2
andthusisapplicableforBigDataanalyticsandintegrityverification.
Keywords
HypothesisTesting;DistributionTesting;ChiSquareTesting;DataIntegrity;BigData;GapStatistics;MaxGap;MinGap;DataIntegrity;
GonzalezAlgorithm;ClosestPair

Introduction
Distributiontestingisafundamentalstatisticalproblemthatarisesinawiderangeofpracticalapplications.Atits
core, the problem is to assess whether a dataset that is assumed to comprise samples from a known probability
distributionisinfactconsistentwiththatassumption.Forexample,iftheendstateofacomputersimulationofa
physical system is a set of points with an expected physicsprescribed distribution, then any detected deviation
fromthatexpecteddistributioncouldundermineconfidenceintheresultsobtainedandpossiblyintheintegrityof
thesimulationsystemitself.
Data integrity verification is a related application for distribution testing in which the objective is to detect
evidenceoftampering,e.g.,humanaltereddata.Forexample,manysourcesofnumericaldataproducenumbers
withfirstdigitsconformingtotheBenfordNewcombfirstdigitdistribution(Thisphenomenonisoftenreferredto
asBenfordsLaw)[1,2],whiledigitsotherthanthefirstandlastareuniformlysampledfrom{0,...,9}[3].Digitsin
humancreatednumbers,bycontrast,tendtoexhibithighregularitywithallelementsof{0,...,9}representedwith
nearlyequalcardinality.Statisticallyidentifieddeviationsofthiskindhavebeenusedtouncoveractsofscientific
misconductandaccountingfraud[4,5,6,7,8,9],butthereisanincreasingneedforhighersensitivitytests.
Thereisofcoursenowaytomakeanunequivocalbinaryassessmentofwhetheradatasetofsamplesconformstoa
given distribution assumption, but it is possible to devise statistical tests which can assign a rigorous likelihood
estimatetothehypothesisthatthedatasetdoes(ordoesnot)representsamplesfromtheassumeddistribution.In
this paper we briefly review the most widelyused method for distribution testing, the chisquare ( 2 ) test, and
then develop alternative tests based on the statistics of gapwidths between data items of consecutive rank. Our
principal contribution is a maxgap test which is shown to provide superior sensitivity to regularity deviations
from a uniform distribution that are relevant to data integrity testing [10, 11, 12]. We show that this test can be
evaluatedwiththesameoptimalcomputationalcomplexity(serialandparallel)astheconventional 2 testandis
thereforesuitableforextremelylargescaledatasets.
Chi-square Test
The 2 test is a statistical measure that can be applied to a discrete dataset to assess the hypothesis that its
elementsweresampledfromaparticulardistribution.Morespecifically,itisahistogrambasedmethodtomeasure
the goodnessoffit between the observed frequency distribution and the expected (theoretical) frequency

11

www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015

distribution.Thegeneralprocedureofthetestincludesthefollowingsteps:
1.

Calculate the chisquare statistic, 2 , which is a normalized sum of squared differences (deviations)
betweenobservedandexpectedfrequencies.

2.

Determine the degrees of freedom, df, of that statistic, which is essentially the number of frequencies
reducedbythenumberofparametersofthefitteddistribution.

3.

Compare 2 withthecriticalvalueforthechisquaredistributionwithdfdegreesoffreedom.

FIG.1COMPLEMENTOFTHECUMULATIVEDISTRIBUTIONFUNCTIONOFTHE 2 DISTRIBUTION,SHOWING 2 ONTHExAXIS


ANDPVALUEONTHEyAXIS[13]

An example of the complement of the cumulative distribution function of the 2 distribution is shown in Fig. 1
withdifferentdegreesoffreedomvalues.Foruniformitytesting,theprocedurecanbeexpressedasfollows:
1.

Given N observations, construct an N bin histogram. Let bi be the bin count for the ith bin (i 1,, N ) ,
whichistheobservedasfrequencydistribution.Aswearetestingforuniformity,theexpectedfrequency
distribution ei 1, i 1,, N .

2.

Computethechisquareteststatistic:

i 1

(bi ei ) 2
(bi 1) 2 (1)
ei
i 1

3.

The number of degrees of freedom, df, is N 1 for this case because the counts for N 1 bins uniquely
determinethecountfortheremainingbin.

4.

Compute the complement of the cumulative distribution function of the 2 distribution with 2 and df
obtainedfromtheprevioussteps.Comparethisvaluewiththesignificancelevel forthetestresult.

Despite being the de~facto standard for assessing dataset consistency with respect to a given distribution
assumption,the 2 testisnotoptimallysensitivetothetypesofdeviationfromuniformitythatariseinmanydata
integrity applications. One example involves narrowband missing data resulting from a corrupted sensor or
measurementprocess.Anotherexampleinvolvesdatathataregeneratedfromanonrandomprocessandexhibits
ahigherdegreeofdataregularitythanisexpectedforauniformdistribution[14,15].Datasetsofthelatterkindare
typical of artificial and humangenerated data, e.g., as in a forged dataset that has been tailored to include
deviations that qualitatively resemble (to humans) uniform random deviates. In the following section we
demonstratetheadvantageoftheproposedmaxgaptestover 2 fornarrowbandandhighregularitydeviations
fromuniformity.
Max-gap Test
The maximum gap, or maxgap, for a dataset of real values is defined as the maximum difference between

12

StatisticsResearchLetters(SRL)Volume42015www.srljournal.org

elementsofconsecutiverank,whichcanbedeterminedfromasortedorderingofthedataset.Thedistributionof
spacingsbetweenconsecutiverankitemsinadatasethasbeenexaminedintheliterature[16,17,18,19],andwe
summarizeheresomeoftheresultsrelevanttogapanalysis.AssumewearegivenN1observationsontheopen
unit interval (0, 1) which divide the interval into N intervals whose lengths in ascending order are denoted by
S (1) S ( 2) S ( N ) .Foruniformitytesting,weareinterestedin S ( N ) ,asitisthemaxgapoftheobservations.The
exactdistributionof S ( N ) is[19]:
v N

P( S ( N ) x)

(1) v 1 Nx

N 1
,(2)

v 0

where a max(a,0) .
Fromthepvalueofthemaxgap S ( N ) ,denotedbyp,wecanperformamaxgaptestforuniformitybychecking

forthetwosidedtest,where isthesignificancelevel.
2
2
WhenNislarge,wemayreplacecomputationoftheexactcumulativedistributionofthemaxgapinEqn.2with
thefollowingasymptoticresult[19]:

thecondition p fortheonesidedtest,or 1

P ( S ( N ) x ) e e

ln N Nx

,(3)

wheretheexpectedvalueof S ( N ) is

E S( N )

ln N
,(4)
N

where isEulersconstant.
An efficient maxgap test for uniformity can then be formalized as follows: Given N 1 observations xi , and a

significancelevel ,computethemaxgap S ( N ) of 0,1 xi .Next,thepvalueofthestatisticsiscalculatedas:


p 1 e e

ln N NS( N )

If the pvalue satisfies p for the onesided test, or 1

(5)

for the twosided test, the observations are


2
2
deemedtopassthetest.Otherwise,thesetofobservationsisassessedtobeinconsistentwithauniformsampling
hypothesisandfailsthetest.

In the next section, we present results of experiments comparing the relative sensitivities of the 2 test and the
maxgaptestfor,e.g.,indentifyinganomalousregularityinapresumeduniformdistribution.
Experiments
In this section, we compare the maxgap test versus the most wellknown and commonly used 2 test. We
conductedfourexperimentsinvolvingdatasetsofN=10,000samples,withtheresultforeachexperimentobtained
asanaverageofonemillionindependenttests.Sensitivityisassessedbycomparingtherespectivepvaluesforthe
onesided forms of the two tests, where smaller values indicate greater sensitivity. The first experiment was
performedusingadatasetofsamplesfromatrueuniformdistribution.Asexpected,thedatasetpassedbothtests
foruniformitywithp=0.5.
The second experiment examined sensitivity to the difference between a uniform distribution and a normal
distributionwithstandarddeviation sampledwithinafixedinterval(0,1).Thedistinctiveshapeofthenormal
distribution is realized within the interval when is small but flattens with increasing values and approaches
uniformity. Both tests are equally sensitive for small , and both approach p = 0.5 for large , but the 2 test
exhibits higher sensitivity for intermediate values (see Fig. 2). The latter is not surprising because the 2 test is
ideallysensitivetodeviationsfromnormality.

13

www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015

FIG.2THEPVALUESOFTHE TESTANDTHEMAXGAPTESTOFANORMALDISTRIBUTIONSAMPLEDWITHINAFIXED
2

INTERVAL.WHENTHESTANDARDDEVIATION( )ISSMALL,BOTHTESTSEASILYIDENTIFYTHEDATASNONUNIFORMITY.AS
INCREASES,THEDATADISTRIBUTIONAPPROACHESUNIFORMITYWITHINTHESAMPLEINTERVALANDHENCETHEP
VALUESCONVERGETO0.5.THISISANEXAMPLEINWHICHTHE 2 TESTPROVIDESINHERENTLYGREATERSENSITIVITYTHAN
THEMAXGAPTEST

Thethirdexperimentexaminedsensitivityofthetwoteststoauniformdistributionwithanarrowbandexclusion
(Fig.3).Thisofcourseisaproblemforwhichthemaxgaptestisideallysuited,andEqn.4revealsthatsuperior
sensitivity. What is possibly the most interesting about the results is that the 2 test provides only modest
sensitivityevenastheexclusionwidthapproachesonepercentofthedistributionwindow.

FIG.4PVALUESOFTHE TESTANDTHEMAXGAPTEST
2

FORHIGHREGULARITYDATA.REGULARITYFORADATASET
OFSIZENISPARAMETERIZEDBYANUMBEROFBINSKWITH
N/KUNIFORMSAMPLESWITHINEACHOFKEQUALBINS.
THUSK=1GENERATESAUNIFORMDISTRIBUTIONAND
INCREASINGKAPPROACHESREGULARSPACING.THEMAX
GAPTESTAGAINDEMONSTRATESGREATERSENSITIVITY

FIG.3PVALUESOFTHE TESTANDTHEMAXGAPTEST
2

FORNARROWBANDMISSINGDATA.INTHISCASETHE
MAXGAPTESTPROVIDESINHERENTLYGREATER
SENSITIVITY

The fourth experiment is the most relevant to data integrity applictions. It examined sensitivity to regularity in
sample spacing. Anomalous distribution regularity is a common characteristic of humanaltered data because
people typically underestimate the degree of natural ``clustering that is present in data sampled from a truly
uniformdistribution.Asaconsequence,humancreatedorhumanaltereddatatendtohavehigherregularity,i.e.,
tendtobe``moreevenlydistributed,thanwhatisexpectedforuniformlydistributeddata.Moregenerally,high
regularity deviations from uniformity can arise from the unanticipated influence of a structured or nonrandom
process,e.g.,frequencycombingeffectsfromaphysicalsensororsimulationartifactsresultingfromalowquality
pseudorandomnumbergenerator.

14

StatisticsResearchLetters(SRL)Volume42015www.srljournal.org

Aregularityparameter 1 k N wasusedforthisexperimentbyuniformlydistributingN/ksampleswithineach
ofkequalwidthsubdivisionsofthedistributioninterval.Thus,k=1representsauniformsamplingovertheentire
interval and produces a uniform distribution; and as k increases to N, the spacing between samples becomes
increasingly regular. Although uniform and highregularity distributions are difficult for humans to distinguish
visually, Fig. 4 shows that the maxgap test provides significantly higher sensitivity than 2 to subtle regularity
deviationsfromuniformity.
Min-Gap Test
The onesided variants of the maxgap and 2 tests were used because they provide a practical balance between
high sensitivity and low false alarm rates, but the onesided or twosided of either test may provide the optimal
tradeoff for the needs of a particular given application. In some applications, the optimal tradeoff might be
obtainedfromamingap, S (1) ,test.Themingapapproximateddistributionisgivenby[19]
N

P ( S (1) x) e

eln N Nx

N 1

ln N Nx v

v!

v 0

,(6)

anditsexpectedvalueis[19]

E S (1)

ln N

N 1

i
i 1

(7)

A mingap test can be defined and performed analogously to the maxgap test and would be ideally suited for
detecting spuriouslyreplicated data items. However, several simpler nonstatistical methods can be applied to
detect replicated data, so the potential applications of the mingap test may be somewhat more limited than the
maxgaptest.
Computational Considerations
Intermsofcomputationalcomplexity,both 2 andmaxgaptestscanbeevaluatedinoptimalO(N)timeandO(N)
space.ThiscomplexityisachievedformaxgapbytheuseoftheGonzalezalgorithm[20,21]todeterminethemax
gap in linear time without sorting. The Gonzalez algorithm performs a special binning which guarantees by the
pigeonholeprinciplethatthemaxgapdataitemswillbefoundasthemaximumandminimumvalues,respectively,
in consecutive nonempty bins. This algorithm allows the maxgap test to be evaluatedin optimal O(N) time and
space, i.e., the same as 2 , and is as efficiently parallelizable as the 2 test (The maxgap and 2 tests are both
highlyamenabletoparallelizationwithO(N/P)timecomplexityonPprocessors.).
ThemingappairneededtoimplementamingaptestwhichcanbeidentifiedinoptimalexpectedO(N)timeand
spaceusingRabinsrandomizedclosestpairalgorithm[22,23].UnliketheGonzalezalgorithmformaxgap,Rabins
algorithmgeneralizesefficientlytohigherdimensions.
Discussion and Future Work
Wehavedefinedanddevelopedamaxgaptestfordistinguishingdeviationsfromuniformityina1Ddatasetofsize
N. By using Gonzalezs algorithm, we have shown that this test can be performed with commensurate efficiency,
both serial and in parallel, with the conventional 2 test. Our experiments demonstrate that the maxgap test
provides improved sensitivity in two particular applications of relevance to data integrity verification. More
generally,theproposedmaxgapandmingaptestsareofpotentialvalueasalternativesortocomplementtheuseof
2 fordistributiontestinganddiscrimination.
Therearemanystatisticaltestsforequalityofdistributionsbeyondthe 2 testsuchastheKolmogorovSmirnovtest
[24,25,26]andtheCramervonMisestest[27].Ofcoursetherecanbenotestthatisuniformlysuperiortoallothers

15

www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015

for all possible distributions, but it appears that most of the standard tests examined in the literature would be
challengedsimilarlytothe 2 testtodistinguishuniformfromregularlydistributeddata.
Potential future work could consider tests which jointly combine gap and 2 statistics into a more sophisticated
singletest[28]whichallowsgreaterflexibilitytooptimizethesensitivityandfalsealarmtradeoffforproblemsof
highpracticalinterest,e.g.,bigdataanalyticsandintegrityverification.Onthealgorithmicside,wehavepointed
outthattheGonzalezalgorithmdoesnotgeneralizetohigherdimensions;however,relativelyefficientsubquadratic
algorithms do exist for solving the largest empty circle and largest empty rectangle problems in two dimensions
such as the algorithms in [29, 30]. Tests on 2D distributions could also potentially exploit information about the
largestemptyregionofaVoronoidecompositionorthedistributionofnearestneighbordistancesfromaDelaunay
triangulation. In d > 2 dimensions, it may be possible to devise gaprelated statistical tests based on results from
efficient algorithms for identifying approximations to the largest empty dsphere or drectangle, but this is purely
speculative. In higher dimensions, it may be better to abandon gaptype statistics and focus on statistics gleaned
fromefficientlycomputablekdandorthant(quad,octant,etc.)treedecompositionsofpointsets.
If computational efficiency is less of a concern, a perhaps more fruitful direction for highlysensitive distribution
testinginhighdimensionsistoexaminethelengthoftheEuclideanminimumspanningtree(EMST)foradataset.
TheexpectedlengthoftheEMSTofuniformlydistributedpointscanbedeterminedusinganalysissimilartowhat
has been described in this paper for estimating the expected values for the max and min gaps in 1D, and we
conjecture that EMSTlength is likely to be more sensitive to many practicallyimportant types of deviations from
uniformity than the conventional 2 test. Such an EMST test would be computationally expensive (though
subquadratic),butthiscostcouldbejustifiedinapplicationsforwhichsubtledeviationsarecriticallyimportant,e.g.,
highfidelityphysicssimulations.
REFERENCES

[1]

M.NigriniandJ.Wells,BenfordsLaw:ApplicationsforForensicAccounting,Auditing,andFraudDetection,ser.Wiley
CorporateF&A.Wiley,2012.[Online].Available:http://books.google.com/books?id=FdRPh787I7oC

[2]

C. Winter, M. Schneider, and Y. Yannikos, Modelbased digit analysis for fraud detection overcomes limitations of
benfordanalysis,SeventhInternationalConferenceonAvailability,ReliabilityandSecurity,vol.0,pp.255261,2012.

[3]

S. Dlugosz and U. MllerFunk, The value of the last digit: Statistical fraud detection with digit analysis, Advances in
DataAnalysisandClassification,vol.3,no.3,pp.281290,2009.[Online].Available:http://dx.doi.org/10.1007/s11634009
00485

[4]

R.Pirracchio,M.RescheRigon,S.Chevret,andD.Journois,Dosimplescreeningstatisticaltoolshelptodetectreporting
bias?AnnalsofIntensiveCare,vol.3,no.1,2013.[Online].Available:http://dx.doi.org/10.1186/21105820329

[5]

R.J.BoltonandD.J.Hand,Statisticalfrauddetection:Areview,StatisticalScience,vol.17,no.3,pp.235255,August
2002.[Online].Available:http://dx.doi.org/10.1214/ss/1042727940

[6]

N.KingstonandA.Clark,TestFraud:StatisticalDetectionandMethodology,ser.RoutledgeResearchinEducation.Taylor
&Francis,2014.[Online].Available:http://books.google.com/books?id=3fzpAwAAQBAJ

[7]

M. Nigrini, Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations, ser. Wiley Corporate
F&A.Wiley,2011.[Online].Available:http://books.google.com/books?id=ct9CB4eJCXYC

[8]

A. Diekmann,MethodologicalArtefacts, Data Manipulation and Fraud in Economics and SocialScience, ser. Jahrbucher
Nationalokonomie

und

Statistik.

Lucius

&

Lucius,

2011.

[Online].

Available:

http://books.google.com/books?id=vzJlczjAz4sC
[9]

L.LeemannandD.Bochsler,Asystematicapproachtostudyelectoralfraud,ElectoralStudies,vol.35,no.0,pp.3347,
2014.[Online].Available:http://www.sciencedirect.com/science/article/pii/S0261379414000390

[10] Y.Fujii,Theanalysisof168randomisedcontrolledtrialstotestdataintegrity,Anaesthesia,vol.67,no.6,pp.669670,
2012.[Online].Available:http://dx.doi.org/10.1111/j.13652044.2012.07189.x

16

StatisticsResearchLetters(SRL)Volume42015www.srljournal.org

[11] S. AlMarzouki, S. Evans, T. Marshall, and I. Roberts, Are these data real? statistical methods for the detection of data
fabricationinclinicaltrials,BMJ,vol.331,no.7511,pp.267270,2005.
[12] U.Simonsohn,Justpostitthelessonfromtwocasesoffabricateddatadetectedbystatisticsalone,Psychologicalscience,
vol.24,no.10,pp.18751888,2013.
[13] M.

Haggstrom.

(2010)

Complement

of

chisquare

cumulative

distribution.

[Online].

Available:

http://en.wikipedia.org/wiki/File:Chisquare\distributionCDFEnglish.png
[14] J.H.PittandH.Z.Hill,Statisticaldetectionofpotentiallyfabricateddata:Acasestudy,ArXiveprints,November2013.
[15] H.Z.HillandJ.H.Pitt,Failuretoreplicate:Asignofscientificmisconduct?Publications,vol.2,no.3,pp.7182,2014.
[Online].Available:http://www.mdpi.com/23046775/2/3/71
[16] D. A. Darling, On a class of problems related to the random division of an interval, The Annals of Mathematical
Statistics,vol.24,no.2,pp.239253,June1953.
[17] R.Pyke,Spacings,JournaloftheRoyalStatisticalSociety,pp.395449,1965.
[18] R.Pyke,Spacingsrevisited,pp.417427,1972.
[19] L. Holst, On the lengths of the pieces of a stick broken at random, Journal of Applied Probability, pp. 623634,
September1980.
[20] T. Gonzalez, Algorithms on sets and related problems, Department of Computer Science, University of Oklahoma,
Norman,OK,Tech.Rep.,1975.
[21] T.Gonzalez,Clusteringtominimizethemaximuminterclusterdistance,TheoreticalComputerScience,vol.38,pp.293
306,1985.
[22] M.Golin,R.Raman,C.Schwarz,andM.Smid,Simplerandomizedalgorithmsforclosestpairproblems,NordicJournal
ofComputing,vol.2,no.1,pp.327,March1995.
[23] R.Lipton,Rabinflipsacoin,inTheP=NPQuestionandGdelsLostLetter.SpringerUS,2010,pp.7780.
[24] Z.W.BirnbaumandF.H.Tingey,Onesidedconfidencecontoursforprobabilitydistributionfunctions,TheAnnalsof
Mathematical

Statistics,

vol.

22,

no.

4,

pp.

592596,

December

1951.

[Online].

Available:

http://dx.doi.org/10.1214/aoms/1177729550
[25] W. Conover, Practical nonparametric statistics, ser. Wiley series in probability and statistics: Applied probability and
statistics.Wiley,1999.[Online].Available:https://books.google.com/books?id=dYEpAQAAMAAJ
[26] G.Marsaglia,W.W.Tsang,andJ.Wang,EvaluatingKolmogorovsdistribution,JournalofStatisticalSoftware,vol.8,no.
18,pp.14,2003.[Online].Available:http://www.jstatsoft.org/index.php/jss/article/view/v008i18
[27] H.Cramer,Onthecompositionofelementaryerrors,ScandinavianActuarialJournal,vol.1928,no.1,pp.1374,1928.
[Online].Available:http://dx.doi.org/10.1080/03461238.1928.10416862
[28] D.Maynes,Combiningstatisticalevidenceforincreasedpowerindetectingcheating,inPresentationgivenattheannual
meetingoftheNationalCouncilofMeasurementinEducation,San,Diego,CA,April,2009.
[29] B.Chazelle,R.L.Drysdale,andD.T.Lee,Computingthelargestemptyrectangle,SIAMJournalofComputing,vol.15,
no.1,pp.300315,February1986.
[30] A.Naamad,D.Lee,andW.Hsu,Onthemaximumemptyrectangleproblem,DiscreteAppliedMathematics,vol.8,no.
3,pp.267277,1984.

17

S-ar putea să vă placă și