Documente Academic
Documente Profesional
Documente Cultură
org
doi:10.14355/srl.2015.04.003
GapMeasureTestswithApplicationsto
DataIntegrityVerification
TrucLe*1,JeffreyUhlmann2
DepartmentofComputerScience,UniversityofMissouriColumbia,201EBW,Columbia,MO,USA
*1
tdlxqb@mail.missouri.edu;2uhlmannj@missouri.edu
Abstract
In this paper we propose and examine gap statisics for assessing uniform distribution hypotheses. We provide examples
relevanttodataintegritytestingforwhichmaxgapstatisticsprovidegreatersensitivitythanchisquare ( 2 ) ,thusallowingthe
newtesttobeusedinplaceoforasacomplementto 2 testingforpurposesofdistinguishingalargerclassofdeviationsfrom
uniformity.Weestablishthattheproposedmaxgaptesthasthesamesequentialandparallelcomputationalcomplexityas 2
andthusisapplicableforBigDataanalyticsandintegrityverification.
Keywords
HypothesisTesting;DistributionTesting;ChiSquareTesting;DataIntegrity;BigData;GapStatistics;MaxGap;MinGap;DataIntegrity;
GonzalezAlgorithm;ClosestPair
Introduction
Distributiontestingisafundamentalstatisticalproblemthatarisesinawiderangeofpracticalapplications.Atits
core, the problem is to assess whether a dataset that is assumed to comprise samples from a known probability
distributionisinfactconsistentwiththatassumption.Forexample,iftheendstateofacomputersimulationofa
physical system is a set of points with an expected physicsprescribed distribution, then any detected deviation
fromthatexpecteddistributioncouldundermineconfidenceintheresultsobtainedandpossiblyintheintegrityof
thesimulationsystemitself.
Data integrity verification is a related application for distribution testing in which the objective is to detect
evidenceoftampering,e.g.,humanaltereddata.Forexample,manysourcesofnumericaldataproducenumbers
withfirstdigitsconformingtotheBenfordNewcombfirstdigitdistribution(Thisphenomenonisoftenreferredto
asBenfordsLaw)[1,2],whiledigitsotherthanthefirstandlastareuniformlysampledfrom{0,...,9}[3].Digitsin
humancreatednumbers,bycontrast,tendtoexhibithighregularitywithallelementsof{0,...,9}representedwith
nearlyequalcardinality.Statisticallyidentifieddeviationsofthiskindhavebeenusedtouncoveractsofscientific
misconductandaccountingfraud[4,5,6,7,8,9],butthereisanincreasingneedforhighersensitivitytests.
Thereisofcoursenowaytomakeanunequivocalbinaryassessmentofwhetheradatasetofsamplesconformstoa
given distribution assumption, but it is possible to devise statistical tests which can assign a rigorous likelihood
estimatetothehypothesisthatthedatasetdoes(ordoesnot)representsamplesfromtheassumeddistribution.In
this paper we briefly review the most widelyused method for distribution testing, the chisquare ( 2 ) test, and
then develop alternative tests based on the statistics of gapwidths between data items of consecutive rank. Our
principal contribution is a maxgap test which is shown to provide superior sensitivity to regularity deviations
from a uniform distribution that are relevant to data integrity testing [10, 11, 12]. We show that this test can be
evaluatedwiththesameoptimalcomputationalcomplexity(serialandparallel)astheconventional 2 testandis
thereforesuitableforextremelylargescaledatasets.
Chi-square Test
The 2 test is a statistical measure that can be applied to a discrete dataset to assess the hypothesis that its
elementsweresampledfromaparticulardistribution.Morespecifically,itisahistogrambasedmethodtomeasure
the goodnessoffit between the observed frequency distribution and the expected (theoretical) frequency
11
www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015
distribution.Thegeneralprocedureofthetestincludesthefollowingsteps:
1.
Calculate the chisquare statistic, 2 , which is a normalized sum of squared differences (deviations)
betweenobservedandexpectedfrequencies.
2.
Determine the degrees of freedom, df, of that statistic, which is essentially the number of frequencies
reducedbythenumberofparametersofthefitteddistribution.
3.
Compare 2 withthecriticalvalueforthechisquaredistributionwithdfdegreesoffreedom.
An example of the complement of the cumulative distribution function of the 2 distribution is shown in Fig. 1
withdifferentdegreesoffreedomvalues.Foruniformitytesting,theprocedurecanbeexpressedasfollows:
1.
Given N observations, construct an N bin histogram. Let bi be the bin count for the ith bin (i 1,, N ) ,
whichistheobservedasfrequencydistribution.Aswearetestingforuniformity,theexpectedfrequency
distribution ei 1, i 1,, N .
2.
Computethechisquareteststatistic:
i 1
(bi ei ) 2
(bi 1) 2 (1)
ei
i 1
3.
The number of degrees of freedom, df, is N 1 for this case because the counts for N 1 bins uniquely
determinethecountfortheremainingbin.
4.
Compute the complement of the cumulative distribution function of the 2 distribution with 2 and df
obtainedfromtheprevioussteps.Comparethisvaluewiththesignificancelevel forthetestresult.
Despite being the de~facto standard for assessing dataset consistency with respect to a given distribution
assumption,the 2 testisnotoptimallysensitivetothetypesofdeviationfromuniformitythatariseinmanydata
integrity applications. One example involves narrowband missing data resulting from a corrupted sensor or
measurementprocess.Anotherexampleinvolvesdatathataregeneratedfromanonrandomprocessandexhibits
ahigherdegreeofdataregularitythanisexpectedforauniformdistribution[14,15].Datasetsofthelatterkindare
typical of artificial and humangenerated data, e.g., as in a forged dataset that has been tailored to include
deviations that qualitatively resemble (to humans) uniform random deviates. In the following section we
demonstratetheadvantageoftheproposedmaxgaptestover 2 fornarrowbandandhighregularitydeviations
fromuniformity.
Max-gap Test
The maximum gap, or maxgap, for a dataset of real values is defined as the maximum difference between
12
StatisticsResearchLetters(SRL)Volume42015www.srljournal.org
elementsofconsecutiverank,whichcanbedeterminedfromasortedorderingofthedataset.Thedistributionof
spacingsbetweenconsecutiverankitemsinadatasethasbeenexaminedintheliterature[16,17,18,19],andwe
summarizeheresomeoftheresultsrelevanttogapanalysis.AssumewearegivenN1observationsontheopen
unit interval (0, 1) which divide the interval into N intervals whose lengths in ascending order are denoted by
S (1) S ( 2) S ( N ) .Foruniformitytesting,weareinterestedin S ( N ) ,asitisthemaxgapoftheobservations.The
exactdistributionof S ( N ) is[19]:
v N
P( S ( N ) x)
(1) v 1 Nx
N 1
,(2)
v 0
where a max(a,0) .
Fromthepvalueofthemaxgap S ( N ) ,denotedbyp,wecanperformamaxgaptestforuniformitybychecking
forthetwosidedtest,where isthesignificancelevel.
2
2
WhenNislarge,wemayreplacecomputationoftheexactcumulativedistributionofthemaxgapinEqn.2with
thefollowingasymptoticresult[19]:
thecondition p fortheonesidedtest,or 1
P ( S ( N ) x ) e e
ln N Nx
,(3)
wheretheexpectedvalueof S ( N ) is
E S( N )
ln N
,(4)
N
where isEulersconstant.
An efficient maxgap test for uniformity can then be formalized as follows: Given N 1 observations xi , and a
ln N NS( N )
(5)
In the next section, we present results of experiments comparing the relative sensitivities of the 2 test and the
maxgaptestfor,e.g.,indentifyinganomalousregularityinapresumeduniformdistribution.
Experiments
In this section, we compare the maxgap test versus the most wellknown and commonly used 2 test. We
conductedfourexperimentsinvolvingdatasetsofN=10,000samples,withtheresultforeachexperimentobtained
asanaverageofonemillionindependenttests.Sensitivityisassessedbycomparingtherespectivepvaluesforthe
onesided forms of the two tests, where smaller values indicate greater sensitivity. The first experiment was
performedusingadatasetofsamplesfromatrueuniformdistribution.Asexpected,thedatasetpassedbothtests
foruniformitywithp=0.5.
The second experiment examined sensitivity to the difference between a uniform distribution and a normal
distributionwithstandarddeviation sampledwithinafixedinterval(0,1).Thedistinctiveshapeofthenormal
distribution is realized within the interval when is small but flattens with increasing values and approaches
uniformity. Both tests are equally sensitive for small , and both approach p = 0.5 for large , but the 2 test
exhibits higher sensitivity for intermediate values (see Fig. 2). The latter is not surprising because the 2 test is
ideallysensitivetodeviationsfromnormality.
13
www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015
FIG.2THEPVALUESOFTHE TESTANDTHEMAXGAPTESTOFANORMALDISTRIBUTIONSAMPLEDWITHINAFIXED
2
INTERVAL.WHENTHESTANDARDDEVIATION( )ISSMALL,BOTHTESTSEASILYIDENTIFYTHEDATASNONUNIFORMITY.AS
INCREASES,THEDATADISTRIBUTIONAPPROACHESUNIFORMITYWITHINTHESAMPLEINTERVALANDHENCETHEP
VALUESCONVERGETO0.5.THISISANEXAMPLEINWHICHTHE 2 TESTPROVIDESINHERENTLYGREATERSENSITIVITYTHAN
THEMAXGAPTEST
Thethirdexperimentexaminedsensitivityofthetwoteststoauniformdistributionwithanarrowbandexclusion
(Fig.3).Thisofcourseisaproblemforwhichthemaxgaptestisideallysuited,andEqn.4revealsthatsuperior
sensitivity. What is possibly the most interesting about the results is that the 2 test provides only modest
sensitivityevenastheexclusionwidthapproachesonepercentofthedistributionwindow.
FIG.4PVALUESOFTHE TESTANDTHEMAXGAPTEST
2
FORHIGHREGULARITYDATA.REGULARITYFORADATASET
OFSIZENISPARAMETERIZEDBYANUMBEROFBINSKWITH
N/KUNIFORMSAMPLESWITHINEACHOFKEQUALBINS.
THUSK=1GENERATESAUNIFORMDISTRIBUTIONAND
INCREASINGKAPPROACHESREGULARSPACING.THEMAX
GAPTESTAGAINDEMONSTRATESGREATERSENSITIVITY
FIG.3PVALUESOFTHE TESTANDTHEMAXGAPTEST
2
FORNARROWBANDMISSINGDATA.INTHISCASETHE
MAXGAPTESTPROVIDESINHERENTLYGREATER
SENSITIVITY
The fourth experiment is the most relevant to data integrity applictions. It examined sensitivity to regularity in
sample spacing. Anomalous distribution regularity is a common characteristic of humanaltered data because
people typically underestimate the degree of natural ``clustering that is present in data sampled from a truly
uniformdistribution.Asaconsequence,humancreatedorhumanaltereddatatendtohavehigherregularity,i.e.,
tendtobe``moreevenlydistributed,thanwhatisexpectedforuniformlydistributeddata.Moregenerally,high
regularity deviations from uniformity can arise from the unanticipated influence of a structured or nonrandom
process,e.g.,frequencycombingeffectsfromaphysicalsensororsimulationartifactsresultingfromalowquality
pseudorandomnumbergenerator.
14
StatisticsResearchLetters(SRL)Volume42015www.srljournal.org
Aregularityparameter 1 k N wasusedforthisexperimentbyuniformlydistributingN/ksampleswithineach
ofkequalwidthsubdivisionsofthedistributioninterval.Thus,k=1representsauniformsamplingovertheentire
interval and produces a uniform distribution; and as k increases to N, the spacing between samples becomes
increasingly regular. Although uniform and highregularity distributions are difficult for humans to distinguish
visually, Fig. 4 shows that the maxgap test provides significantly higher sensitivity than 2 to subtle regularity
deviationsfromuniformity.
Min-Gap Test
The onesided variants of the maxgap and 2 tests were used because they provide a practical balance between
high sensitivity and low false alarm rates, but the onesided or twosided of either test may provide the optimal
tradeoff for the needs of a particular given application. In some applications, the optimal tradeoff might be
obtainedfromamingap, S (1) ,test.Themingapapproximateddistributionisgivenby[19]
N
P ( S (1) x) e
eln N Nx
N 1
ln N Nx v
v!
v 0
,(6)
anditsexpectedvalueis[19]
E S (1)
ln N
N 1
i
i 1
(7)
A mingap test can be defined and performed analogously to the maxgap test and would be ideally suited for
detecting spuriouslyreplicated data items. However, several simpler nonstatistical methods can be applied to
detect replicated data, so the potential applications of the mingap test may be somewhat more limited than the
maxgaptest.
Computational Considerations
Intermsofcomputationalcomplexity,both 2 andmaxgaptestscanbeevaluatedinoptimalO(N)timeandO(N)
space.ThiscomplexityisachievedformaxgapbytheuseoftheGonzalezalgorithm[20,21]todeterminethemax
gap in linear time without sorting. The Gonzalez algorithm performs a special binning which guarantees by the
pigeonholeprinciplethatthemaxgapdataitemswillbefoundasthemaximumandminimumvalues,respectively,
in consecutive nonempty bins. This algorithm allows the maxgap test to be evaluatedin optimal O(N) time and
space, i.e., the same as 2 , and is as efficiently parallelizable as the 2 test (The maxgap and 2 tests are both
highlyamenabletoparallelizationwithO(N/P)timecomplexityonPprocessors.).
ThemingappairneededtoimplementamingaptestwhichcanbeidentifiedinoptimalexpectedO(N)timeand
spaceusingRabinsrandomizedclosestpairalgorithm[22,23].UnliketheGonzalezalgorithmformaxgap,Rabins
algorithmgeneralizesefficientlytohigherdimensions.
Discussion and Future Work
Wehavedefinedanddevelopedamaxgaptestfordistinguishingdeviationsfromuniformityina1Ddatasetofsize
N. By using Gonzalezs algorithm, we have shown that this test can be performed with commensurate efficiency,
both serial and in parallel, with the conventional 2 test. Our experiments demonstrate that the maxgap test
provides improved sensitivity in two particular applications of relevance to data integrity verification. More
generally,theproposedmaxgapandmingaptestsareofpotentialvalueasalternativesortocomplementtheuseof
2 fordistributiontestinganddiscrimination.
Therearemanystatisticaltestsforequalityofdistributionsbeyondthe 2 testsuchastheKolmogorovSmirnovtest
[24,25,26]andtheCramervonMisestest[27].Ofcoursetherecanbenotestthatisuniformlysuperiortoallothers
15
www.srljournal.orgStatisticsResearchLetters(SRL)Volume42015
for all possible distributions, but it appears that most of the standard tests examined in the literature would be
challengedsimilarlytothe 2 testtodistinguishuniformfromregularlydistributeddata.
Potential future work could consider tests which jointly combine gap and 2 statistics into a more sophisticated
singletest[28]whichallowsgreaterflexibilitytooptimizethesensitivityandfalsealarmtradeoffforproblemsof
highpracticalinterest,e.g.,bigdataanalyticsandintegrityverification.Onthealgorithmicside,wehavepointed
outthattheGonzalezalgorithmdoesnotgeneralizetohigherdimensions;however,relativelyefficientsubquadratic
algorithms do exist for solving the largest empty circle and largest empty rectangle problems in two dimensions
such as the algorithms in [29, 30]. Tests on 2D distributions could also potentially exploit information about the
largestemptyregionofaVoronoidecompositionorthedistributionofnearestneighbordistancesfromaDelaunay
triangulation. In d > 2 dimensions, it may be possible to devise gaprelated statistical tests based on results from
efficient algorithms for identifying approximations to the largest empty dsphere or drectangle, but this is purely
speculative. In higher dimensions, it may be better to abandon gaptype statistics and focus on statistics gleaned
fromefficientlycomputablekdandorthant(quad,octant,etc.)treedecompositionsofpointsets.
If computational efficiency is less of a concern, a perhaps more fruitful direction for highlysensitive distribution
testinginhighdimensionsistoexaminethelengthoftheEuclideanminimumspanningtree(EMST)foradataset.
TheexpectedlengthoftheEMSTofuniformlydistributedpointscanbedeterminedusinganalysissimilartowhat
has been described in this paper for estimating the expected values for the max and min gaps in 1D, and we
conjecture that EMSTlength is likely to be more sensitive to many practicallyimportant types of deviations from
uniformity than the conventional 2 test. Such an EMST test would be computationally expensive (though
subquadratic),butthiscostcouldbejustifiedinapplicationsforwhichsubtledeviationsarecriticallyimportant,e.g.,
highfidelityphysicssimulations.
REFERENCES
[1]
M.NigriniandJ.Wells,BenfordsLaw:ApplicationsforForensicAccounting,Auditing,andFraudDetection,ser.Wiley
CorporateF&A.Wiley,2012.[Online].Available:http://books.google.com/books?id=FdRPh787I7oC
[2]
C. Winter, M. Schneider, and Y. Yannikos, Modelbased digit analysis for fraud detection overcomes limitations of
benfordanalysis,SeventhInternationalConferenceonAvailability,ReliabilityandSecurity,vol.0,pp.255261,2012.
[3]
S. Dlugosz and U. MllerFunk, The value of the last digit: Statistical fraud detection with digit analysis, Advances in
DataAnalysisandClassification,vol.3,no.3,pp.281290,2009.[Online].Available:http://dx.doi.org/10.1007/s11634009
00485
[4]
R.Pirracchio,M.RescheRigon,S.Chevret,andD.Journois,Dosimplescreeningstatisticaltoolshelptodetectreporting
bias?AnnalsofIntensiveCare,vol.3,no.1,2013.[Online].Available:http://dx.doi.org/10.1186/21105820329
[5]
R.J.BoltonandD.J.Hand,Statisticalfrauddetection:Areview,StatisticalScience,vol.17,no.3,pp.235255,August
2002.[Online].Available:http://dx.doi.org/10.1214/ss/1042727940
[6]
N.KingstonandA.Clark,TestFraud:StatisticalDetectionandMethodology,ser.RoutledgeResearchinEducation.Taylor
&Francis,2014.[Online].Available:http://books.google.com/books?id=3fzpAwAAQBAJ
[7]
M. Nigrini, Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations, ser. Wiley Corporate
F&A.Wiley,2011.[Online].Available:http://books.google.com/books?id=ct9CB4eJCXYC
[8]
A. Diekmann,MethodologicalArtefacts, Data Manipulation and Fraud in Economics and SocialScience, ser. Jahrbucher
Nationalokonomie
und
Statistik.
Lucius
&
Lucius,
2011.
[Online].
Available:
http://books.google.com/books?id=vzJlczjAz4sC
[9]
L.LeemannandD.Bochsler,Asystematicapproachtostudyelectoralfraud,ElectoralStudies,vol.35,no.0,pp.3347,
2014.[Online].Available:http://www.sciencedirect.com/science/article/pii/S0261379414000390
[10] Y.Fujii,Theanalysisof168randomisedcontrolledtrialstotestdataintegrity,Anaesthesia,vol.67,no.6,pp.669670,
2012.[Online].Available:http://dx.doi.org/10.1111/j.13652044.2012.07189.x
16
StatisticsResearchLetters(SRL)Volume42015www.srljournal.org
[11] S. AlMarzouki, S. Evans, T. Marshall, and I. Roberts, Are these data real? statistical methods for the detection of data
fabricationinclinicaltrials,BMJ,vol.331,no.7511,pp.267270,2005.
[12] U.Simonsohn,Justpostitthelessonfromtwocasesoffabricateddatadetectedbystatisticsalone,Psychologicalscience,
vol.24,no.10,pp.18751888,2013.
[13] M.
Haggstrom.
(2010)
Complement
of
chisquare
cumulative
distribution.
[Online].
Available:
http://en.wikipedia.org/wiki/File:Chisquare\distributionCDFEnglish.png
[14] J.H.PittandH.Z.Hill,Statisticaldetectionofpotentiallyfabricateddata:Acasestudy,ArXiveprints,November2013.
[15] H.Z.HillandJ.H.Pitt,Failuretoreplicate:Asignofscientificmisconduct?Publications,vol.2,no.3,pp.7182,2014.
[Online].Available:http://www.mdpi.com/23046775/2/3/71
[16] D. A. Darling, On a class of problems related to the random division of an interval, The Annals of Mathematical
Statistics,vol.24,no.2,pp.239253,June1953.
[17] R.Pyke,Spacings,JournaloftheRoyalStatisticalSociety,pp.395449,1965.
[18] R.Pyke,Spacingsrevisited,pp.417427,1972.
[19] L. Holst, On the lengths of the pieces of a stick broken at random, Journal of Applied Probability, pp. 623634,
September1980.
[20] T. Gonzalez, Algorithms on sets and related problems, Department of Computer Science, University of Oklahoma,
Norman,OK,Tech.Rep.,1975.
[21] T.Gonzalez,Clusteringtominimizethemaximuminterclusterdistance,TheoreticalComputerScience,vol.38,pp.293
306,1985.
[22] M.Golin,R.Raman,C.Schwarz,andM.Smid,Simplerandomizedalgorithmsforclosestpairproblems,NordicJournal
ofComputing,vol.2,no.1,pp.327,March1995.
[23] R.Lipton,Rabinflipsacoin,inTheP=NPQuestionandGdelsLostLetter.SpringerUS,2010,pp.7780.
[24] Z.W.BirnbaumandF.H.Tingey,Onesidedconfidencecontoursforprobabilitydistributionfunctions,TheAnnalsof
Mathematical
Statistics,
vol.
22,
no.
4,
pp.
592596,
December
1951.
[Online].
Available:
http://dx.doi.org/10.1214/aoms/1177729550
[25] W. Conover, Practical nonparametric statistics, ser. Wiley series in probability and statistics: Applied probability and
statistics.Wiley,1999.[Online].Available:https://books.google.com/books?id=dYEpAQAAMAAJ
[26] G.Marsaglia,W.W.Tsang,andJ.Wang,EvaluatingKolmogorovsdistribution,JournalofStatisticalSoftware,vol.8,no.
18,pp.14,2003.[Online].Available:http://www.jstatsoft.org/index.php/jss/article/view/v008i18
[27] H.Cramer,Onthecompositionofelementaryerrors,ScandinavianActuarialJournal,vol.1928,no.1,pp.1374,1928.
[Online].Available:http://dx.doi.org/10.1080/03461238.1928.10416862
[28] D.Maynes,Combiningstatisticalevidenceforincreasedpowerindetectingcheating,inPresentationgivenattheannual
meetingoftheNationalCouncilofMeasurementinEducation,San,Diego,CA,April,2009.
[29] B.Chazelle,R.L.Drysdale,andD.T.Lee,Computingthelargestemptyrectangle,SIAMJournalofComputing,vol.15,
no.1,pp.300315,February1986.
[30] A.Naamad,D.Lee,andW.Hsu,Onthemaximumemptyrectangleproblem,DiscreteAppliedMathematics,vol.8,no.
3,pp.267277,1984.
17