Analiza Cateogoriala A Datelor

Analiza categorială a datelor
Radu Trîmbit, as,
5/6/2020
Testul χ2 pentru proport, ii

Exemplul 1. Să presupunem că dorim să testăm dacă un zar este perfect sau măsluit. Se aruncă zarul
de mai multe ori s, i dacă fiecare fat, ă apare cam în 1/6 din cazuri, se poate presupune că zarul este bun.
Aruncând zarul de 60 de ori se obt, in frecvent, ele
Număr 1 2 3 4 5 6
Aparit, ii 7 12 10 12 8 11
Să se verifice dacă zarul este corect, pentru α = 5%.

Number<-1:6;
Occurences<-c(7, 12, 10, 12, 8, 11);
chisq.test(Occurences)
##
## Chi-squared test for given probabilities
##
## data: Occurences
## X-squared = 2.2, df = 5, p-value = 0.8208
qu<-qchisq(0.95,5); qu
## [1] 11.0705
Valoarea statisticii este mai mică decât cuantila χ21−α s, i p = 0.8208 > α, deci acceptăm ipoteza nulă.
Vom da o reprezentare intuitivă pentru test
x<-seq(from=0,to=20,by=0.25);
y<-dchisq(x,5)
plot(x,y,type="l")
abline(h=0)
segments(x0=2.2,y0=dchisq(2.2,5),x1=2.2,y1=0)
text(2.8, dchisq(2.2,5)/2, expression(chiˆ2))
xc<-c(qu,x[x>qu]);
yc=dchisq(xc,5)
xcg<-c(xc,20,qu)
ycg<-c(yc,0,0)
polygon(xcg,ycg,col="red")
1
0.15
0.10
y
χ2
0.05
0.00
0 5 10 15 20
Se observă că valoarea statisticii testului este înafara regiunii de resoingere, deci ipoteza nulă se acceptă. Nu
se poate afirma că zarul este măsluit.
Exemplul 2. Teoria mendeliană a eredităt, ii afirmă că atunci când se încrucis, ează două varietăt, i de mazăre,
frecvent, ele pentru rotund s, i galben, zbârcit s, i galben, rotund s, i verde, zbârcit s, i verde apar în raportul 9:3:3:1.
Când a testat această teorie, Mendel a obt, inut frecvent, ele 315, 101,108 s, i respectiv 32. Ne permit aceste date
de select, ie să respingem teoria la nivelul de semnificat, ie de 5%?
c(315, 101, 108, 32)->freq
chisq.test(freq,p=c(9,3,3,1)/16)
##
## Chi-squared test for given probabilities
##
## data: freq
qchisq(0.95,length(freq)-1)
## [1] 7.814728
Deoarece p > α, nu putem respinge teoria mendeliană.
Tabele de contingent, ă
Testarea independent, ei
Exemplul 3. Presupunem că dorim să clasificăm defectele găsite la mobila fabricată într-o întreprindere
de profil conform (1) tipului de defect s, i (2) schimbul din product, ie. Un total de n = 309 defecte au fost
înregistrate s, i clasificate în unul din cele 4 tipuri A, B, C sau D. În acelas, i timp fiecare piesă de mobilier a
2
fost identificată după schimbul în care ea a fost fabricată. Frecvent, ele apar în tabela M. Obiectivul nostru
este de a testa ipoteza nulă că tipul de defect este independent de schimb în raport cu alternativa că cele
două criterii de clasificare sunt dependente.
Tip de defect
Schimb A B C D Total
1 15(22.51) 21(20.99) 45(38.94) 13(11.56) 94
2 26(22.99) 31(21.44) 34(39.77) 5(11.81) 96
3 33(28.50) 17(26.57) 49(49.29) 20(14.63) 119
Total 74 69 128 38 309
%
## defect
## shift A B C D
## 1 15 21 45 13
## 2 26 31 34 5
## 3 33 17 49 20
M<-as.table(rbind(c(15, 21, 45, 13), c(26, 31, 34, 5), c(33, 17, 49, 20)));
dimnames(M) <- list(shift = 1:3,
defect = c("A","B", "C", "D"))
M
## defect
## shift A B C D
## 1 15 21 45 13
## 2 26 31 34 5
## 3 33 17 49 20
margin.table(M,1)
## shift
## 1 2 3
## 94 96 119
margin.table(M,2)
## defect
## A B C D
## 74 69 128 38
(Xsq <- chisq.test(M)) # Prints test summary
##
## Pearson's Chi-squared test
##
## data: M
Xsq$observed # observed counts (same as M)
## defect
## shift A B C D
## 1 15 21 45 13
## 2 26 31 34 5
## 3 33 17 49 20
3
Xsq$expected # expected counts under the null
## defect
## shift A B C D
## 1 22.51133 20.99029 38.93851 11.55987
## 2 22.99029 21.43689 39.76699 11.80583
## 3 28.49838 26.57282 49.29450 14.63430
Xsq$residuals # Pearson residuals
## defect
## shift A B C D
## 1 -1.58312831 0.00211911 0.97138105 0.42356986
## 2 0.62770015 2.06546616 -0.91450874 -1.98076347
## 3 0.84325427 -1.85703848 -0.04194534 1.40261994
Xsq$stdres # standardized residuals
## defect
## shift A B C D
## 1 -2.176313381 0.002882619 1.521561771 0.542225141
## 2 0.866935883 2.822806706 -1.439187071 -2.547514417
## 3 1.233122714 -2.687181259 -0.069891776 1.910016157
Deoarece p = 0.003873 < α, ipoteza nulă se respinge. Schimbul s, i defectul nu sunt independente.
Testarea omogenităt, ii
S-a realizat un sondaj asupra atitudinii cetăt, enilor din patru circumscript, ii fat, ă de candidatul A. S-au ales
select, ii aleatoare de 200 de voturi din fiecare circumscript, ie, iar rezultatul apare în tabela 2. Ne permit datele
să afirmăm că proport, iile de cetăt, eni din fiecare circumscript, ie favorabili lui A diferă?
Circumscript, ia
Opinia 1 2 3 4 Total
Pentru 76(59) 53(59) 59(59) 48(59) 236
Contra 124(141) 147(141) 141(141) 152(141) 564
Total 200 200 200 200 800
%
M<-as.table(rbind(c(76, 53, 59, 48), c(124, 147, 141, 152)));
dimnames(M) <- list(opinion=c("favor A","do not favor A"), Ward=1:4)
M
## Ward
## opinion 1 2 3 4
## favor A 76 53 59 48
## do not favor A 124 147 141 152
margin.table(M,1)
## opinion
## favor A do not favor A
## 236 564
margin.table(M,2)
## Ward
4
## 1 2 3 4
## 200 200 200 200
(Xsq <- chisq.test(M)) # Prints test summary
##
## Pearson's Chi-squared test
##
## data: M
Xsq$observed # observed counts (same as M)
## Ward
## opinion 1 2 3 4
## favor A 76 53 59 48
## do not favor A 124 147 141 152
Xsq$expected # expected counts under the null
## Ward
## opinion 1 2 3 4
## favor A 59 59 59 59
## do not favor A 141 141 141 141
Teste de concordant, ă
Dezintegrare a izotopului Americiu-241 cu test de concordant, ă χ2 Se presupune că dezintegrarea se
face după distribut, ia Poisson. Datele se dau mai jos. Avem 19 intervale de cate 10 secunde s, i frecvent, ele
pentru fiecare interval.
emissions <- 0:19
observed <- c(1, 4, 13, 28, 56, 105, 126, 146, 164, 161, 123, 101, 74, 53, 23, 15, 9, 3, 1, 1)
total <- sum(observed);
λ estimat
lambda <- sum(emissions*observed)/total; lambda
## [1] 8.367026
expected <- dpois(emissions,lambda)
obsnew<-c(sum(observed))
n<-length(observed)
obsnew<-c(sum(observed[1:3]),observed[4:(n-3)],sum(observed[(n-2):n])); obsnew
## [1] 18 28 56 105 126 146 164 161 123 101 74 53 23 15 9 5

expnew<-c(ppois(2,lambda),expected[4:(n-3)],ppois(16,lambda,lower.tail=FALSE))*total; expnew
## [1] 12.446560 27.385220 57.283211 95.858019 133.674418 159.779613

## [7] 167.110015 155.357088 129.987674 98.873655 68.939868 44.370896
## [13] 26.518030 14.791803 7.735212 6.888717
Deoarece testul χ2 necesită ca frecvent, ele teoretice dintr-o celulă să fie >5 vom combina primele 3 s, i ultimele
3 celule în două celule. Tabela recombinată se dă mai jos
data.frame(ems=c("0-2",as.character(3:16),">=17"),observed=obsnew,expected=expnew)
## ems observed expected

## 1 0-2 18 12.446560
5
## 2 3 28 27.385220
## 3 4 56 57.283211
## 4 5 105 95.858019
## 5 6 126 133.674418
## 6 7 146 159.779613
## 7 8 164 167.110015
## 8 9 161 155.357088
## 9 10 123 129.987674
## 10 11 101 98.873655
## 11 12 74 68.939868
## 12 13 53 44.370896
## 13 14 23 26.518030
## 14 15 15 14.791803
## 15 16 9 7.735212
## 16 >=17 5 6.888717
X2<-sum((obsnew-expnew)ˆ2/expnew); X2
## [1] 8.949307
pval<-pchisq(X2,length(expnew)-2,lower.tail = F); pval
## [1] 0.8342842
plot(emissions,observed/total)
lines(emissions,dpois(emissions,lambda))
0.12
observed/total
0.08
0.04
0.00
0 5 10 15
emissions
Test de normalitate. Datele provin de la o fabrică de doze de tablă. Vrem să vedem dacă urmează
6
distribut, ia normală cu media m = 170 s, i σ 2 = 10. Vom folosi testul Kolmogorov.
weights <- c(165.1,171.5,168.1,165.6,166.8,170.0,168.8,
171.1,168.8,173.6,163.5,169.9,165.4,174.4,
171.8,166.0,174.6,174.5,166.4,173.8)
x<-seq(from=170-3.5*sqrt(10),to=170+3.5*sqrt(10),by=7*sqrt(10)/50);
y<-dnorm(x,mean=170, sd=sqrt(10))
plot(x,y,type="l",main="density estimation for can factory",col="blue", xlab="weights")
lines(density(weights))
density estimation for can factory

0.12
0.08
y
0.04
0.00
160 165 170 175 180
weights
#lines(x,y,col="blue")
ks.test(weights,"pnorm",170,sqrt(10))
## Warning in ks.test(weights, "pnorm", 170, sqrt(10)): ties should not be present

## for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: weights
## D = 0.19421, p-value = 0.4376
## alternative hypothesis: two-sided
Ipoteza de normalitate se acceptă.
Test Kolmogorov cu două select, ii Doi meri în floare sunt situat, i în apropiere, unul de soiul Redwell s, i
altul Whitney. Albinele îl preferă pe unul sau pe altul? Am urmărit albinele s, i am cronometrat timpul pe
care fiecare albină îl petrece într-un măr. Am oprit cronometrul când albina este la mai mult de un metru de
7
pom. Vom folosi testul Kolmogorov-Smirnov pentru două select, ii.
#apple trees
redwell=c(23.4, 30.9, 18.8, 23.0, 21.4, 1, 24.6, 23.8, 24.1, 18.7, 16.3, 20.3, 14.9, 35.4,
21.6, 21.2, 21.0, 15.0, 15.6, 24.0, 34.6, 40.9, 30.7, 24.5, 16.6, 1, 21.7, 1, 23.6,
1, 25.7, 19.3, 46.9, 23.3, 21.8, 33.3, 24.9, 24.4, 1, 19.8, 17.2, 21.5, 25.5, 23.3,
18.6, 22.0, 29.8, 33.3, 1, 21.3, 18.6, 26.8, 19.4, 21.1, 21.2, 20.5, 19.8, 26.3,
39.3, 21.4, 22.6, 1, 35.3, 7.0, 19.3, 21.3, 10.1, 20.2, 1, 36.2, 16.7, 21.1, 39.1,
19.9, 32.1, 23.1, 21.8, 30.4, 19.62, 15.5);
whitney=c(16.5, 1, 22.6, 25.3, 23.7, 1, 23.3, 23.9, 16.2, 23.0, 21.6, 10.8, 12.2, 23.6, 10.1,
24.4, 16.4, 11.7, 17.7, 34.3, 24.3, 18.7, 27.5, 25.8, 22.5, 14.2, 21.7, 1, 31.2, 13.8,
29.7, 23.1, 26.1, 25.1, 23.4, 21.7, 24.4, 13.2, 22.1, 26.7, 22.7, 1, 18.2, 28.7, 29.1,
27.4, 22.3, 13.2, 22.5, 25.0, 1, 6.6, 23.7, 23.5, 17.3, 24.6, 27.8, 29.7, 25.3, 19.9,
18.2, 26.2, 20.4, 23.3, 26.7, 26.0, 1, 25.1, 33.1, 35.0, 25.3, 23.6, 23.2, 20.2, 24.7,
22.6, 39.1, 26.5, 22.7);
ks.test(redwell,whitney,alternative=c("greater"))
## Warning in ks.test(redwell, whitney, alternative = c("greater")): cannot compute

## exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: redwell and whitney
## D^+ = 0.22041, p-value = 0.02102
## alternative hypothesis: the CDF of x lies above that of y
ks.test(redwell,whitney,alternative=c("less"))
## Warning in ks.test(redwell, whitney, alternative = c("less")): cannot compute

## exact p-value with ties
##
## Two-sample Kolmogorov-Smirnov test
##
## data: redwell and whitney
## D^- = 0.12421, p-value = 0.2933
## alternative hypothesis: the CDF of x lies below that of y
Fn1<-ecdf(redwell);
Fn2<-ecdf(whitney);
x=seq(from=0, to=50, by=0.5);
p<-which.max(abs(Fn1(x)-Fn2(x))); p
## [1] 45
plot(x,Fn1(x),type="s",col="red")
lines(x,Fn2(x),type="s",col="blue")
arrows(x0=x[p],y0=Fn1(x[p]),x1=x[p],y1=Fn2(x[p]),col="green",code=3,
length=0.1)
xleg=35;yleg=0.30;
legend(xleg,yleg,c("redwell", "whitney", "KS statistics"),
col=c("red","blue","green"),lty=c(1,1,1))
8
1.0
0.8
0.6
Fn1(x)
0.4
0.2
redwell
whitney
KS statistics
0.0
0 10 20 30 40 50
Ipoteza nulă se respinge. Avem F1 (x) > F2 (x). Se pare ca albinele preferă Redwell. Comentat, i rezultatele
testului unilateral stânga.

Analiza Cateogoriala A Datelor

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Analiza Cateogoriala A Datelor

Încărcat de

Drepturi de autor:

Formate disponibile

Analiza categorială a datelor

Radu Trîmbit, as,

Testul χ2 pentru proport, ii

Să se verifice dacă zarul este corect, pentru α = 5%.

## [1] 18 28 56 105 126 146 164 161 123 101 74 53 23 15 9 5

## [1] 12.446560 27.385220 57.283211 95.858019 133.674418 159.779613

## ems observed expected

density estimation for can factory

160 165 170 175 180

## Warning in ks.test(weights, "pnorm", 170, sqrt(10)): ties should not be present

## Warning in ks.test(redwell, whitney, alternative = c("greater")): cannot compute

## Warning in ks.test(redwell, whitney, alternative = c("less")): cannot compute

S-ar putea să vă placă și