Sunteți pe pagina 1din 17

Master MSS 2012 2013

Metode de analiza si prognoza pentru managementul sanitar

Analiza de regresie1 i de corelaie Efectuarea de prognoze economice privind valorile variabilei endogene Y n funcie de diferitele valori exogene X presupune verificarea i eventual, acceptarea ipotezei c legitatea de dependen dintre Y i X este corect specificat i identificat, avnd un caracter de relativ stabilitate i repetabilitate. Primul scop al analizei de regresie este de arta cum este legat o variabil Y de una sau mai multe variabile X cu ajutorul unei ecuaii care d posibilitatea de a previziona variabilele dependente Y n funcie de valorile cunoscute ale variabilelor independente X (x1, x2, , xn). In general, prin analiza de regresie se face o comparaie statistic a relaiilor anterioare ntre diferii factori. Dependena statistic este o dependen care se manifest nu ntre elemente i fenomene individuale, ci ntre colectiviti de fenomene. Msurile de asociere elaborate de statistica matematic permit depistarea i ierarhizarea dependenelor statistice, care se manifest ntre fenomenele i procesele istorice. Msurile de asociere statistic deschid astfel posibilitatea descoperirii legitilor statistice specifice acelor relaii de condiionare dintre fenomenele i procesele istorice, care prezint caracteristici statistice cuatificabile. Tabelul 1. Analiza de regresie i analiza de corelaie Prin analiza regresiei se nelege o clas de Analiza corelaiei are ca obiectiv evaluarea metode prin care, folosind o ecuaie de gradului de interdependen (asociere) ntre regresie determinat pe baza unor date variabilele considerate ntr-un model de experimentale, pot fi estimate (previzionate) regresie, n particular ntre variabila valorile unor variabile date, presupunnd dependent i cele independente (obiectiv cunoscute ori previzionate valorile altor care se realizeaz prin estimarea coeficienilor variabile. de corelaie i a coeficientului de determinare). Natura stochastic a modelului de regresie face ca valoarea lui Y s nu poat fi prevzut exact, incertitudinea aprnd ca rezultat la mrimea aleatoare e (eroarea). Distribuia probabilistic a lui Y i caracteristicile sale sunt determinate de valorile lui e i de distribuia sa probabilistic. Ipotezele de aplicare ale metodelor de regresie sunt: - variabilele Y i X nu sunt afectate de erori de msurare. Legitatea de dependen a lui yi este condiionat de realizarea valorilor x1,x2, , xn ale variabilei exogene X; - variabila aleatoare (rezidual) este de medie 0, iar dispersia ei este independent de X (ipoteza de homoscedasticitate2 - se admite c legtura dintre Y i X este relativ stabil); - valorile variabilei reziduale nu sunt autocorelate (nu depind unele de altele); - legea de probabilitate a variabilei reziduale este legea normal cu media 0 i abatere standard Sy/x. Dac aceste ipoteze se verific, metoda celor mai mici ptrate asigur obinerea unor estimatori de maxim verosimilitate. Respectarea acestor ipoteze permite aplicarea unor teste statistice: a. verificarea semnificaiei estimatorilor funciei de regresie (aplicarea unor teste statistice3);
1

The equation used to draw the best-fit straight line is called a regression equation and was first used by Sir Francis Galton (1822-1911) to show that when tall or short couples have children their heights tend to regress, or revert to the mean height of their parents. 2 Homoscedasticitatea este o proprietate a variaiei termenului de perturbare dintr-o ecuaie de regresie n care aceast variaie rmne constant n toate cazurile observate (condiie impus ca estimatorul celor mai mici ptrate s fie cel mai bun estimator liniar).

Master MSS 2012 2013

b. verificarea verosimilitii modelului de ajustare c. elaborarea de prognoze pe baza unui interval de ncredere. In general, previziunile bazate pe analiza regresiei se refer la: valori medii condiionate ale variabilelor dependente (condiionarea fa de valori date ori prognozate ale variabilelor independente); - valori individuale ale valorilor dependente Y. Ambele tipuri de previziuni se obin din ecuaia de regresie determinat pe baza datelor experimentale: se obin aceleai valori numerice, deosebirea constnd n semnificaia acestor valori i n nivelul lor de precizie al estimrilor astfel obinute. Pentru estimarea unei valori individuale a variabilei dependente, nivelul de precizie este mai mic dect n cazul estimrii unei valori medii condiionate a variabilei respective. Evaluarea erorilor de previziune se realizeaz folosind estimri cu intervale de ncredere, o astfel de estimare fiind cu att mai bun cu ct lungimea intervalului este mai mic i nivelul de semnificaie mai apropiat de 1. Interpretarea statistic a rezultatelor regresiei Baza informational pentru modelul liniar Y a b * x Serii de date - pentru variabila explicativ/independent/exogen: x1,x2, ...xn - pentru variabila explicat/dependent/endogen: y1, y2, ...yn Calculul coeficientului a: a Calculul coeficientului b: b

y b x
n n n
n

n
i 1

xi y i
i 1 n

xi
i 1 n

yi

( xi

x) ( y i ( xi x) 2

y)

sau b

i 1 n i 1

n
i 1

2 i 1

Calculul valorilor ajustate: Yi

a b * xi

Evaluarea erorilor de previziune se realizeaz folosind estimri cu intervale de ncredere, o astfel de estimare fiind cu att mai bun cu ct lungimea intervalului este mai mic i nivelul de semnificaie mai apropiat de 1. In general, un interval de ncredere cu nivelul de ncredere , ( 0 ,1 ) , pentru o caracteristic numeric a unei variabile aleatoare este un interval de numere reale de forma: (-t , +t ) unde: este o estimare a caracteristicii de interes, este o msur a mprtierii estimrilor posibile, t se determin din tabelele asociate unor repartiii probabilistice uzuale. t Extremitile ale unui interval de ncredere cu nivel de ncredere se stabilesc astfel nct s se poat spune c exist 100% anse ca estimarea a caracteristicii cercetate s se abat cu cel mult t de la valoarea real a acestei caracteristici (n mod echivalent, se spune c exist 100 (1-)% anse s omitem o eroare mai mic dect ). Din acest motiv, nivelul de ncredere se alege apropiat de 1 (de regul, 0,95 sau 0,99), echivalent cu faptul c diferena 1- (numit i prag / nivel de semnificaie) este apropiat de zero. Prin analiza de corelaie se urmrete:
3

Un test statistic este o mrime calculat pentru testarea ipotezelor. In condiiile ipotezei nule H0, aceast mrime statistic urmeaz o distribuie de probabilitate pe care nu ar urmao n condiiile ipotezei alternative. Cu ct valoarea mrimii statistice de test se abate de la valorile critice ale distribuiei, cu att este mai puin plauzibil ca ipoteza nul s fie adevrat.

Master MSS 2012 2013

msurarea gradului de interdependen ntre variabila dependent Y i variabilele independente Xi, interdependen explicat prin ecuaia de regresie utilizat; evaluarea gradului de asociere ntre variabilele independente, atunci cnd ecuaia de regresie conine cel puin dou variabile independente Xi. Aceasta arat n ce msur dou valori sunt legate ntre ele intensitatea legturii este exprimat cu ajutorul a doi indicatori: coeficientul de corelaie (R) msoar puterea relaiei de dependen liniar printr-o valoare numeric ntre 1 i 1;
R
2 ( xk

xk y k n x)

n x y
2 ( yk

n y)

o Dac R = 0 nu exist corelaie de tip liniar ntre Y i X (dar pot exista alte tipuri de dependen, de exemplu, neliniar) o Dac R > 0 i apropiat de valoarea 1, atunci creterile factorului X vor determina creteri ale variabilei Y o Dac R < 0 i apropiat de -1, atunci scderi ale factorului X vor determina scderi pentru Y. coeficientul de determinare (R2) care msoar reducerea relativ n variaia lui Y ce poate fi atribuit cunoaterii factorilor Xi i a relaiei Y = f(X).
R
2 2 exp 2 tot

2 Sy 2 /x n 2 n 1 S

De exemplu, o valoare R2=0.76 indic c aproximativ 76% din variaia total a variabilei Y poate fi explicat prin variabilele dependente X incluse n model (o valoare 0.8 este considerat acceptabil). Exemplu: =CORREL(valori pentru x, valori pentru y) Coeficientul corectat de determinare R se folosete atunci cnd numrul de observri este egal cu numrul coeficienilor estimai (deoarece fiecare punct de observare se va situa pe funcia de regresie, mrimea eantionului trebuie s fie suficient de mare pentru a estima coeficienii de regresie):
2

R2

k 1 ( 1 R2 ) n k

unde: n reprezint numrul de observaii reale k este numrul coeficienilor de regresie. In cazul regresiei multiple, R2 sau R reprezint o msur a efectului combinat al ansamblului variabilelor independente asupra variabilei dependente. Exemplu: =RSQ(valori pentru y, valori pentru x) Semnificaia statistic a parametrilor modelului Distribuia t (Student)4 se folosete n testele ipotezelor pe eantioane mici i n care variana variabilei respective trebuie estimat n raport cu datele. Este o distribuie de probabilitate n form de clopot, n care valoarea medie este egal cu zero, dispersia variabilelor
4

Testul t este testul cel mai des utilizat n analizele economice cantitative i este definit ca raportul dintre o variabil
2

normal i o variabil

mprit la numrul gradelor de libertate.

Master MSS 2012 2013

n jurul valorii medii fiind dependent de gradele de libertate5 dictate de mrimea eantionului. Gradele de libertate arat numrul de elemente informaionale care pot varia independent unul de altul; se spune c un eantion de n observaii are n grade de libertate. De exemplu, calcularea unei medii simple a eantionului implic pierderea unui grad de libertate deoarece variaiile independente n n-1 din observaiile din eantion vor necesita o schimbare compensatorie n cel de al n lea grad de libertate, pentru a se menine valoarea medie a eantionului. Tot astfel, calcularea valorilor pentru un numr de k parametri n cadrul unui exemplu econometric implic pierderea a k grade de libertate, rmnnd (n-k). Dac erorile sunt distribuite normal se ateapt ca aproximativ 68% dintre valorile lui y s fie situate ntr-un interval mai mic de e (eroarea standard de previziune) uniti fa de valoarea medie, sau 95% la mai puin de 2 e sau 99% la mai puin de 3 e . Fiecare din parametrii estimai este caracterizat de o eroare standard deoarece determinarea lor se face pe baza unui eantion de date; probabil un alt eantion ar duce la obinerea altor valori ale parametrilor modelului. Valoarea aproximativ a statisticii t de verificare a semnificaiei coeficienilor modelului se calculeaz cu relaia:
t coeficient estimat valoarea coeficientului prin ipoteza eroarea s tan dard estimata a coeficientului

Se realizeaz excluderea din model a oricrui coeficient pentru care t _ calc

2,0 . Orice

coeficient pentru care t 2 ,0 este diferit de zero la un nivel de semnificaie de aproximativ 5%. Includerea n model a unor coeficieni cu valori absolute ale statisticii testului t substanial mai mici dect 2.0 va spori numrul parametrilor modelului i va duce la reducerea preciziei prediciei. Tabelul 1. Interpretarea valorilor p p<0,01 0,01<p<0,05 0,05<p<0,1 0,1<p Puternic eviden mpotriva H0 Eviden moderat mpotriva H0 Evidena sugereaz H0 fals Nu exist evidene mpotriva H0

Metoda regresiei simple Pentru regresia liniar simpl X variabila explicativ - predictor Y variabila explicat variabila de tip rspuns

Yi
a r

a b xi ,
SDY SDX

b y a x Unde:
5

Gradele de libertate arat numrul de elemente informaionale care pot varia independent unul de altul. Se spune c un eantion de n observaii are n grade de libertate. Totui, calcularea mediei simple a eantionului implic pierderea unui grad de libertate deoarece variaiile independente n n-1 din observaiile din eantion vor necesita o schimbare compensatorie n cel de al - n -lea grad de libertate, pentru a se menine valoarea medie a eantionului. Tot astfel, calcularea valorilor pentru un numr de k parametri n cadrul unui exemplu econometric implic pierderea a k grade de libertate, rmnnd (n-k). Gradele de libertate intr adesea ca parametri n distribuii de probabilitate (distribuia t sau
2

) crora le poate afecta alura n mod fundamental.

Master MSS 2012 2013

r coeficientul de corelatie ntre X and Y; SDY i SDX sunt deviatiile standard ale varibilelor Y i X. Pentru a aprecia semnificaia estimatorilor: - pentru un set de date de volum n 30 se aplic testul t (Student) cu n-2 grade de libertate6; - pentru n 30 se aplic testul z7 al distribuiei normale8 formulnd ipotezele: H0: a=0 i b=0 Ha: a 0 i b Dac t calc (a)
a
a

0
t i t calc (b) b
b

t atunci ipoteza H0 se respinge i se apreciaz

ca a i b sunt semnificativi din punct de vedere statistic. Regul: dac abs(t_calculat) > t_tabelar, atunci se respinge H0. Observaie: valorile tabelate pentru t Se apeleaza la functia =TINV(probabilitate, nr grade libertate) Exemplu: =TINV(0.05;15)
t_tab (95%) 2,131449536

Sau =TINV(0.10;15)
t_tab (95%) 2,131449536 Distributia t descrie o familie de distributii dependente de marimea esantioanelor. Pentru un esantion ce contine mai mult de 30 murori, distributia t devine identiccu o distributie normal deci pentru entioane mari putem folosi ambele tipuri de distribu (z sau t) pentru calculul intervalului de ncredere (Confidence interval). Intervalul de ncredere corespunzator mediei unor esantioane cu numr mic de msuratori (normal sau aproape normal distribuite): Limita superioara interval de incredere =medie+valoare t * abatere standard/radacina patratica din n. Limita inferioara interval de incredere =medie - valoare t * abatere standard/radacina patratica din n.

n tabelul distribuiei t, valorile sunt grupate n functie de nivelul de semnificaie alpha ('significance level') i de gradul de libertate - df. Pentru a gsi valoarea lui t trebuie s folosim tabelul distributiei t i s cunoatem nivelul de semnificaie i gradele de libertate. Pentru a calcula numrul gradelor de libertate (df) se folosete relatia df = n 1 unde n reprezint numrul valorilor din setul de date. 7 Se folosete funcia =NORMSINV din programul EXCEL 8 Teorema de limit central stabilete c suma (i media) unei mulimi de variabile aleatoare urmeaz o distribuie normal, dac eantionul este suficient de mare, indiferent de forma distribuiei de la care provine variabila individual. Teorema este folosit adesea pentru a explica ipoteza de normalitate a termenului de eroare n studiul econometric, care permite folosirea testului statistic t pentru testarea ipotezelor, deoarece acest termen de eroare se presupune c nglobeaz suma unei mulimi aleatoare de factori necunoscui (omii).

Master MSS 2012 2013

valoarea t =1,96 este asociat cu o probabilitate de 0,05 (pentru limita la dreapta) sau cu o probabilitate de 0,025 (pentru limitare n ambele extreme) exemplu: =TINV(0.05;10000)
t_tab (95%) 1,960201185

Observaie: De asemenea, cu EXCEL se poate determina probabilitatea p asociat valorii calculate a lui t. n acest caz, p = TDIST(ABS(t_calculat), grade libertate, 2).

Master MSS 2012 2013

Regul: Dac p este mai mare dect (nivelul de semnificaie)9, ipoteza H0 se accept. Tabelul 2. Interpretarea riscului de acceptare / respingere a H0 Concluzia Nu respinge Respinge Situaia H0 este Decizie corect Eroare de tipul I (risc de tip real adevrat Decizie corect H0 este fals Eroare de tipul II (risc de tip )

Eroarea de tip I este dat de respingerea ipotezei nule atunci cnd, de fapt, aceasta ar fi trebuit acceptat; - se confirm/valideaz o ipotez care nu este adevrat - impact: concluzii gresite care pot duce la identificarea unor soluii/decizii inadecvate Eroarea de tip II este urmarea acceptrii ipotezei nule cnd, de fapt, aceasta trebuie respins: - n fapt, se ignor/ se pierde un efect important - n consecin, se pot trata dou alternative/ opiuni ca identice dei, n realitate, acestea sunt diferite. Verificarea veridicitii modelului are la baz principiul analizei dispersionale. Tabelul 3. Sursa de variaie Msura variaiei Gradul de Grade de Dispersii influen libertate corectate 2 2 2 Explicat prin 1 ( yi y ) exp l exp l model
2 tot

1
n-2
2 rez

Rezidual

( yi

yi )

2 rez 2 tot

n 2
n-1

Total

( yi

y)

Se poate demonstra c raportul


2 exp/ 2 rez

( yi ( yi
2 exp/ 2 rez

y)

yi )2

este o variabil aleatoare cu o distribuie Fisher Snedecor. Dac F

F pentru n-k, respectiv k grade de libertate atunci variaia lui y este

explicat de variaia lui x. Raportul R

i (y ( yi

y)

y) 2

se numete raport de corelaie i exprim

gradul de fidelitate a modelului fa de dependena statistic dintre Y i X. Semnificaia statistic a lui R se poate testa cu testul F (Fisher-Snedecor);

De regul, =0.05.

Master MSS 2012 2013

dac Fcalc

(n 2)

R2 1 R2

Ftab pentru n-k, respectiv k grade de libertate atunci

R este semnificativ (n cazul regresiei liniare simple).

Fcalc

var iatie exp licata (k 1) var iatie ne exp licata (n k )

R 2 (k 1) (1 R 2 ) (n k )

Valoarea testului F se folosete pentru a testa semnificaia coeficienilor de regresie; se testeaz ipoteza potrivit creia variabila dependent este statistic necorelat cu variabilele independente incluse n model. Pentru determinarea lui F_tab, se apeleaz la functia =FINV(probabilitate, nr grade libertate1; nr grade libertate2) Exemplu: =FINV(0.05;1;15)
F_tab 4,543077123

Ipoteza nul H0 se formuleaz astfel: variana explicat este egal cu variana rezidual; testul F se calculeaz ca raport ntre cele dou variane i compar rezultatul cu o valoarea critic tabelat Fcrit. dac ipoteza H0 nu poate fi respins, atunci ponderea variaiei explicate va avea o pondere mic n variaia total a modelului de regresie. La limit, dac R2=0, atunci F=0. Pe msur ce valoarea F crete, ipoteza c variabila Y nu este dependent statistic de variabilele X considerate devine mai uor de respins. dac Fcalc>Ftab ipoteza nul poate fi respins (coeficienii de regresie au semnificaie statistic).
Anexa statistic: Dreapta de regresie

~ y

a bx unde: a=M[A]; b=M[B].


M [ y] ~ y.

Valorii experimentale yi i corespunde pe dreapta de regresie valoarea estimat,

~ yi

a bxi

Abaterile valorilor reale yi, dar necunoscute, fa de valorile estimate

~ y

(de pe dreapta de regresie) sunt:

yi
R

~ yi
n

yi
( yi

(a bxi )
a bxi ) 2

yi

a bxi

Parametrii a i b se determin din condiia ca suma abaterilor ptratice s fie minim:

i 1 n

Pentru aceasta se deriveaz expresia lui R, adica egaleaz cu zero:

R
i 1

( yi

a bxi ) 2

n raport cu a i b i se

R a R a

a b

( yi
i 1 n

a bxi ) 2 a bxi ) 2

0 0

( yi
i 1

Se ajunge astfel la sistemul de ecuaii:

Master MSS 2012 2013


n

( yi
i 1 n

a bxi ) a bxi )xi

0 0

( yi
i 1

Din prima ecuaie se obine prin substituie:

a
n

1 n ( yi ni1
( yi

bxi )

1 n yi ni1
n

b n xi ni1
y ) b( xi

y bx
x )]xi 0

y bx bxi ) xi
i 1
n

[( yi

i 1

De unde:
n

( yi b
i 1 n

y ) xi
i 1 n

xi y i

nx y

( xi
i 1

x ) xi
i 1

xi2 nx 2

Rezult astfel urmtoarele expresii ale parametrilor:

a b

y xi y i x
2 i

xi y i x
2 i

nx y nx 2

nx y nx 2

~ y a bxi

xi yi x
2 i

nxy nx 2

( xi

x)

Intervale de ncredere pentru parametrii estimai


Metoda regresiei nu necesit nici o ipotez asupra legii de repartiie a variabilei aleatoare y. Aceast variabil aleatoare are media teoretic M[y]=A+Bx, iar dispersia constant pentru toate valorile lui x i egal cu 2 (valoare n general necunoscut). Dac repartiia lui y este normal i observaiile sunt fcute la ntmplare se poate construi un interval de ncredere pentru parametrii dreptei de regresie. Dispersiile parametrilor a i b sunt date de relaiile:
2 2 b

( xi

x)2
2

2 a

1 n

x2 ( xi x ) 2
2 a i 2 b se pot construi intervale de ncredere pentru

Cu ajutorul estimatorilor punctuali


2

conform

celor prezentate anterior. Deoarece este n general necunoscut, intervalul se poate determina considernd dispersia rezidual diferit de abaterile variabilei y n raport cu valorile dreptei de regresie (valorile estimate) y exprimat de relaia:

s y2
x

1 n 2

( yi
i 1

~ y )2

i care definete variabila aleatoare student cu n-2 grade de libertate. Statistica

(n 2) s v2x

are o repartiie

Pentru un nivel de semnificaie adevrate A i B:

se obin urmtoarele intervale de ncredere bilaterale ale valorilor

Master MSS 2012 2013

10

xi yi x
2 i

nx y nx
2, 2

sy t
n 2, 2
x

( xi

x)2

y bx t n

sy
x

1 n

x2 ( xi x ) 2
*

Intervalul de ncredere a valorii medii y cunoscut


va avea n medie valoarea

estimate prin regresie pentru un x

Considernd determinai parametrii a i b ai dreptei de regresie, pentru o valoare cunoscut (dat)

x* , y
2

y*

a bx * .

Variabila aleatoare normal normat

~ y * M[~ y] /

~ y2

i variabila

(n 2) s 2 y /
x

cu repartiii

cu n-2 grade de libertate. n acest caz pentru un nivel de semnificaie crui relaie are expresia:

se obine intervalul de ncredere bilateral a

M[~ y*]

A Bx

~ y * tn

2, 2

sy 1
x

( x* x ) 2 ( xi x ) 2

Metoda regresiei multiple Variabila dependent Y este pus n dependen de variabilele Xk considerate factori explicativi pentru nivelul i al caracteristicii :

Yi

a0

a1 xi1
a1

a2 xi 2
a2

... an xin
an

(ecuaia de regresie n form aditiv) sau

Yi

a0

xi1

xi 2

....x in

(n form multiplicativ). Distincia ntre cele dou forme este fundamental pentru interpretarea economic a coeficienilor de regresie: n cazul liniar, un coeficient ak, k=1,,n reprezint panta variaiei variabile Y fa de variabila explicativ Xk, adic modificarea lui Y ca urmare a variaiei cu o unitate a nivelului lui Xk (n ipoteza c toi ceilali factori rmn constani), n cazul neliniar, un coeficient ak reprezint coeficientul de elasticitate al variabilei explicate Y n funcie de variabila explicativ Xk (arat modificarea procentual a variabilei rezultative Y atunci cnd factorul Xk variaz cu un procent i toi ceilali factori sunt constani).

Metoda regresiei logistice Regresia logistic modeleaz relaia dintre o mulime de variabile independente x i (categoriale, continue) i o variabil dependent dihotomic (nominal, binar) Y. O astfel de variabil dependent apare, de regul, atunci cnd reprezint apartenena la dou clase, categorii prezen/absen, da/nu etc. Ecuaia de regresie obinut, de un tip diferit de celelalte regresii discutate, ofer informaii despre: importana variabilelor n diferenierea claselor, clasificarea unei observaii ntr-o clas. De remarcat c diagrama de mprtiere a valorilor nu ofer nici un indiciu n privinta dependenelor. n asemenea cazuri, regresia liniar clasic nu ofer un model adecvat.

Master MSS 2012 2013

11

Presupunem c valorile y (variabil binar) sunt codificate 0/1, valoarea 1 exprimnd n general apariia unui anumit eveniment, astfel nct ceea ce se caut este o estimare a probabilitii de producere a respectivului eveniment n funcie de valorile variabilelor independente. Cazul unei singure variabile independente Modelul este: e x P( y 1 x) 1 e x Sau P( y 1 x) ln( ) x. 1 P( y 1 x) Cantitatea din partea stng este numit (transformarea) logit a probabilitii P(y=1|x). Semnificaia expresiei P(y=1|x) este evident: probabilitatea de realizare a valorii y=1 condiionat de valoarea x. Cu alte cuvinte, probabilitatea de clasare a observaiei x n clasa y=1, sau probabilitatea ca valoarea x s fie asociat cu producerea evenimentului y=1. In continuare se noteaz P(y=1|x) cu p, conform notaiei de la modelul probabilist binomial (probabilitatea de succes). Transformarea logit este necesar pentru a proiecta probabilitatea p din intervalul (0,1) n intervalul (- , + ), fapt necesar n procesul de estimare a parametrilor. Modelul este legat direct de noiunea de odds (raport de anse), notat OR (odds report): p OR 1 p care reprezint raportul dintre probabilitatea de succes i probabilitatea de insucces . Modelul se mai poate scrie:

p 1 p

Pentru determinarea coeficienilor de regresie, se foloseste SOLVER din EXCEL, prin calulul:

p ( x)

eL 1 eL

( xi )

p( xi ) yi (1 P( xi ))(1

yi )

Maximizarea logaritmului din funcia de probabilitate max ln( ( xi )


i 1

Master MSS 2012 2013

12

Anexa: Metode bazate pe verificarea ipotezelor In diferite stadii de analiz a caracteristicilor numerice ale unei colectiviti statistice apare deseori necesitatea formulrii i a verificrii unor ipoteze privind natura sau valorile unor parametri pentru variabilele aleatoare teoretice asociate caracteristicilor studiate. Orice presupunere privind repartiia sau caracteristicile variabilei aleatoare X, formulat pe baza unor informaii apriorice privind variabila aleatoare X se numete ipotez statistic. Pe baza informaiilor disponibile, analistul/cercettorul face o ipotez privind caracteristica numit ipotez de baz i notat H0, fa de care pot exista una sau mai multe ipoteze alternative Ha. Pentru simplitate, putem considera c, fa de ipoteza de baz H0, exist o singur ipotez alternativ Ha (dac ipoteza H0 este fals, atunci este adevrat alternativa sa Ha). Dac o ipotez statistic urmeaz a fi acceptat sau respins n funcie de datele uneia sau mai multor selecii se spune c se testeaz aceast ipotez, ipoteza testat fiind numit ipotez de baz sau ipotez nul; prin ipoteza alternativ se nelege o ipotez care poate fi adevrat atunci cnd H0 este fals i care ar putea fi acceptat atunci cnd ipoteza de baz este respins. Pentru verificarea ipotezelor statistice se folosesc metode specifice numite teste statistice. Prin test statistic se nelege o metod conform creia, pe baza datelor unei selecii, o ipotez de baz este fie acceptat fie respins. Dac ipoteza nul H0 are o singur alternativ Ha, iar n urma unui test statistic se decide respingerea ipotezei H0, atunci se accept ipoteza Ha. Dac ipoteza nul are mai multe alternative, atunci respingerea ipotezei nule implic acceptarea uneia dintre alternativele sale, fr a se preciza care dintre acestea este adevrat. Regula de decizie conform creia se accept sau se respinge ipoteza nul are la baz un criteriu de testare (n general, se folosete o funcie de selecie aleas n mod convenabil). Fie H0 o ipotez statistic de baz; o funcie de selecie C(x,n) se numete criteriu de testare a ipotezei H0 dac sunt ndeplinite urmtoarele condiii: a. repartiia variabilei aleatoare C(X,n) depinde de faptul dac ipoteza Ho este adevrat sau fals; b. n cazul n care H0 ar fi adevrat, atunci C(X,n) are repartiia complet specificat. In general n testarea ipotezei H0 decurge astfel: - se fixeaz o mulime de valori de numere reale I, care, de regul, este un interval. Mulimea I se numete regiune de respingere sau regiune critic; - se face o selecie de volum n din colectivitatea studiat, obinndu-se succesiv valorile x1, x2, ..., xn pentru caracteristica numeric analizat. Dac C(x 1 , x 2 , ..., x n ) I , atunci ipoteza nul H0 este acceptat; n caz contrar, H0 este respins. Atunci cnd se testeaz o ipotez statistic se pot produce erori: - dei ipoteza de baz H0 este adevrat, aceasta se respinge n urma testrii; apare ceea ce se numete eroare de tipul I; - dei ipoteza H0 este fals, aceasta se accept c ar fi adevrat; o astfel de eroare se numete eroare de tipul II. Evident, atunci cnd se testeaz o ipotez statistic, este de dorit ca pericolul comiterii unei erori s fie ct mai mic posibil. Prin nivel de semnificaie (alpha) al unui test statistic se nelege probabilitatea maxim acceptat de comitere a unei erori de tipul I. Probabilitatea comiterii unei erori de tipul II se numete risc de tipul II, se noteaz cu . Modul n care a fost definit criteriul de testare ofer posibilitatea controlului erorilor de tipul I i II. Pentru controlul erorilor de tipul II, n locul riscului de tipul II - se mai folosete puterea testului =1-, definit ca probabilitatea respingerii ipotezei nule, atunci cnd aceasta este fals.

Master MSS 2012 2013

13

Anexa 2: Funcii ptr. aplicarea metodei regresiei n EXCEL


Excel includes several array functions for performing linear regression - LINEST, TREND, FORECAST, SLOPE, and STEYX - and exponential regression - LOGEST and GROWTH. These functions are entered as array formulas and they produce array results. You can use each of these functions with one or several independent variables. The following list provides a definition of the different types of regression: Linear regression produces the slope of a line that best fits a single set of data. Based on a year's worth of sales figures, for example, linear regression can tell you the projected sales for March of the following year by giving you the slope and y-intercept (that is, the point where the line crosses the y-axis) of the line that best fits the sales data. By following the line forward in time, you can estimate future sales, if you can safely assume that growth will remain linear. Exponential regression produces an exponential curve that best fits a set of data that you suspect does not change linearly with time. For example, a series of measurements of population growth will nearly always be better represented by an exponential curve than by a line. Multiple regression is the analysis of more than one set of data, which often produces a more realistic projection. You can perform both linear and exponential multiple regression analyses. For example, suppose you want to project the appropriate price for a house in your area based on square footage, number of bathrooms, lot size, and age. Using a multiple regression formula, you can estimate a price, based on a database of information gathered from existing houses. =INTERCEPT(known_y's,known_x's) Known_y's is the dependent set of observations or data. Known_x's is the independent set of observations or data. Remarks The arguments should be either numbers or names, arrays, or references that contain numbers. If an array or reference argument contains text, logical values, or empty cells, those values are ignored; however, cells with the value zero are included. If known_y's and known_x's contain a different number of data points or contain no data points, INTERCEPT returns the #N/A error value. The SLOPE function returns the slope of the linear regression line. The slope is defined as the vertical distance divided by the horizontal distance between any two points on the regression line. Its value is the same as the first number in the array returned by the LINEST function. In other words, SLOPE calculates the trajectory of the line used by the FORECAST and TREND functions to calculate the values of data points. The SLOPE function takes the form =SLOPE(known_y's, known_x's). =SLOPE(known_y's,known_x's) where: Known_y's is an array or cell range of numeric dependent data points. Known_x's is the set of independent data points. Remarks The arguments must be either numbers or names, arrays, or references that contain numbers. If an array or reference argument contains text, logical values, or empty cells, those values are ignored; however, cells with the value zero are included. If known_y's and known_x's are empty or have a different number of data points, SLOPE returns the #N/A error value. LINEST Calculates the statistics for a line by using the "least squares" method to calculate a straight line that best fits your data, and then returns an array that describes the line. You can also combine LINEST with other functions to calculate the statistics for other types of models that are linear in the unknown parameters, including polynomial, logarithmic, exponential, and power series. Because this function returns an array of values, it must be entered as an array formula. The equation for the line is: y = mx + b or y = m1x1 + m2x2 + ... + b (if there are multiple ranges of x-values) where the dependent y-value is a function of the independent x-values. The m-values are coefficients corresponding to each x-value, and b is a constant value. Note that y, x, and m can be vectors. The array that LINEST returns is {mn,mn-1,...,m1,b}. LINEST can also return additional regression statistics. The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves. The difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve. LINEST(known_y's,known_x's,const,stats) Known_y's is the set of y-values you already know in the relationship y = mx + b.

Master MSS 2012 2013

14

If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate variable. If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable. Known_x's is an optional set of x-values that you may already know in the relationship y = mx + b. The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable is used, known_y's must be a vector (that is, a range with a height of one row or a width of one column). If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's. Const is a logical value specifying whether to force the constant b to equal 0. If const is TRUE or omitted, b is calculated normally. If const is FALSE, b is set equal to 0 and the m-values are adjusted to fit y = mx. Stats is a logical value specifying whether to return additional regression statistics. If stats is TRUE, LINEST returns the additional regression statistics, so the returned array is {mn,mn-1,...,m1,b;sen,sen-1,...,se1,seb;r2,sey;F,df;ssreg,ssresid}. If stats is FALSE or omitted, LINEST returns only the m-coefficients and the constant b. Statistic se1,se2,...,sen seb r2 Description The standard error values for the coefficients m1,m2,...,mn. The standard error value for the constant b (seb = #N/A when const is FALSE). The coefficient of determination. Compares estimated and actual y-values, and ranges in value from 0 to 1. If it is 1, there is a perfect correlation in the sample there is no difference between the estimated y-value and the actual y-value. At the other extreme, if the coefficient of determination is 0, the regression equation is not helpful in predicting a y-value. For information about how r2 is calculated, see "Remarks" later in this topic. The standard error for the y estimate. The F statistic, or the F-observed value. Use the F statistic to determine whether the observed relationship between the dependent and independent variables occurs by chance. The degrees of freedom. Use the degrees of freedom to help you find F-critical values in a statistical table. Compare the values you find in the table to the F statistic returned by LINEST to determine a confidence level for the model. For information about how df is calculated, see "Remarks" later in this topic. Example 4 below shows use of F and df. The regression sum of squares. The residual sum of squares. For information about how ssreg and ssresid are calculated, see "Remarks" later in this topic.

sey F df

ssreg ssresid

You can use the F statistic to determine whether these results, with such a high r2 value, occurred by chance. Assume for the moment that in fact there is no relationship among the variables, but that you have drawn a rare sample of 11 office buildings that causes the statistical analysis to demonstrate a strong relationship. The term "alpha" is used for the probability of erroneously concluding that there is a relationship. F and df in LINEST output can be used to assess the likelihood of a higher F value occurring by chance. F can be compared with critical values in published F-distribution tables or Excels FDIST can be used to calculate the probability of a larger F value occurring by chance. The appropriate F distribution has v1 and v2 degrees of freedom. If n is the number of data points and const=TRUE or omitted, then v1=ndf1 and v2=df. (If const = FALSE, then v1=ndf and v2=df.) Excels FDIST(F,v1,v2) will return the probability of a higher F value occurring by chance. =FDIST(x,degrees_freedom1,degrees_freedom2) X is the value at which to evaluate the function. Degrees_freedom1 is the numerator degrees of freedom. Degrees_freedom2 is the denominator degrees of freedom. Remarks If any argument is nonnumeric, FDIST returns the #VALUE! error value. If x is negative, FDIST returns the #NUM! error value. If degrees_freedom1 or degrees_freedom2 is not an integer, it is truncated. If degrees_freedom1 < 1 or degrees_freedom1 10^10, FDIST returns the #NUM! error value. If degrees_freedom2 < 1 or degrees_freedom2 10^10, FDIST returns the #NUM! error value. FDIST is calculated as FDIST=P(F>x), where F is a random variable that has an F distribution with degrees_freedom1 and degrees_freedom2 degrees of freedom.

Master MSS 2012 2013

15

FINV Returns the inverse of the F probability distribution. If p = FDIST(x,...), then FINV(p,...) = x. The F distribution can be used in an F-test that compares the degree of variability in two data sets. For example, you can analyze income distributions in the United States and Canada to determine whether the two countries have a similar degree of diversity. =FINV(probability,degrees_freedom1,degrees_freedom2) Probability is a probability associated with the F cumulative distribution. Degrees_freedom1 is the numerator degrees of freedom. Degrees_freedom2 is the denominator degrees of freedom. Remarks If any argument is nonnumeric, FINV returns the #VALUE! error value. If probability < 0 or probability > 1, FINV returns the #NUM! error value. If degrees_freedom1 or degrees_freedom2 is not an integer, it is truncated. If degrees_freedom1 < 1 or degrees_freedom1 10^10, FINV returns the #NUM! error value. If degrees_freedom2 < 1 or degrees_freedom2 10^10, FINV returns the #NUM! error value. FINV can be used to return critical values from the F distribution. For example, the output of an ANOVA calculation often includes data for the F statistic, F probability, and F critical value at the 0.05 significance level. To return the critical value of F, use the significance level as the probability argument to FINV. FINV uses an iterative technique for calculating the function. Given a probability value, FINV iterates until the result is accurate to within 3x10^-7. If FINV does not converge after 100 iterations, the function returns the #N/A error value. Example FINV(0.01,6,4) equals 15.20675 Calculating the t-Statistics Another hypothesis test will determine whether each slope coefficient is useful in estimating the assessed value. The critical value can also be found using Excels TINV function. =TINV(probability,degrees_freedom) Probability is the probability associated with the two-tailed Student's t-distribution. Degrees_freedom is the number of degrees of freedom with which to characterize the distribution. Remarks If either argument is nonnumeric, TINV returns the #VALUE! error value. If probability < 0 or if probability > 1, TINV returns the #NUM! error value. If degrees_freedom is not an integer, it is truncated. If degrees_freedom < 1, TINV returns the #NUM! error value. TINV returns that value t, such that P(|X|>t) = probability where X is a random variable that follows the tdistribution and P(|X|>t) = P(X<-t or X>t). A one-tailed t-value can be returned by replacing probability with 2*probability. For a probability of 0.05 and degrees of freedom of 10, the two-tailed value is calculated with TINV(0.05,10), which returns 2.28139. The one-tailed value for the same probability and degrees of freedom can be calculated with TINV(2*0.05,10), which returns 1.812462. Note In some tables, probability is described as (1-p). Given a value for probability, TINV seeks that value x such that TDIST(x, degrees_freedom,2) = probability. Thus, precision of TINV depends on precision of TDIST. TINV uses an iterative search technique. If the search has not converged after 100 iterations, the function returns the #N/A error value. The STEYX function calculates the standard error of a regression , a measure of the amount of error accrued in predicting a y for each given x. This function takes the form = STEYX(known_y's, known_x's). The TREND function LINEST returns a mathematical description of the straight line that best fits known data. TREND finds points that lie along that line and that fall into the unknown category . You can use the numbers returned by TREND to plot a trend linea straight line that helps make sense of actual data. You can also use TREND to extrapolate, or make intelligent guesses about, future data based on the tendencies exhibited by known data. (Be careful. Although you can use TREND to plot the straight line that best fits the known data, TREND can't tell you if that line is a good predictor of the future. Validation statistics returned by LINEST can help you make that assessment.) The TREND function takes the form =TREND(known_y's, known_x's, new_x's, const). The first two arguments represent the known values of your dependent and independent variables. As in LINEST, the known_y's argument is a single column, a single row, or a rectangular range. The known_x's argument also follows the pattern described for LINEST. The third and fourth arguments are optional. If you omit new_x's, the TREND function considers new_x's to be identical to known_x's. If you include const, the value of that argument must be TRUE or FALSE (or 1 or 0). If const is TRUE, TREND forces b to be 0.

Master MSS 2012 2013

16

To calculate the trend-line data points that best fit your known data, simply omit the third and fourth arguments from this function. The results array will be the same size as the known_x's range. To create these values, we selected the range ...... and entered =TREND(....., .....) as an array formula using Ctrl+Shift+Enter. Calculating exponential regression Unlike linear regression, which plots values along a straight line, exponential regression describes a curve by calculating the array of values needed to plot it. The equation that describes an exponential regression curve is y = b * m1x1 * m2x2 * * mnxn If you have only one independent variable, the equation is y = b * mx The LOGEST function The LOGEST function works like LINEST, except that you use it to analyze data that is nonlinear, and it returns the coordinates of an exponential curve instead of a straight line. LOGEST returns coefficient values for each independent variable plus a value for the constant b. This function takes the form =LOGEST(known_y's, known_x's, const, stats). LOGEST accepts the same arguments as the LINEST function and returns a result array in the same fashion. If you set the optional stats argument to TRUE, the function also returns validation statistics. Note: The LINEST and LOGEST functions return only the y-axis coordinates used for calculating lines and curves. The difference between them is that LINEST projects a straight line and LOGEST projects an exponential curve. You must be careful to match the appropriate function to the analysis at hand. The LINEST function might be more appropriate for sales projections, and the LOGEST function might be more suited to applications, such as statistical analyses or population trends. =LOGEST(known_y's,known_x's,const,stats) Known_y's is the set of y-values you already know in the relationship y = b*m^x. If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate variable. If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable. Known_x's is an optional set of x-values that you may already know in the relationship y = b*m^x. The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable is used, known_y's must be a range of cells with a height of one row or a width of one column (which is also known as a vector). If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's. Const is a logical value specifying whether to force the constant b to equal 1. If const is TRUE or omitted, b is calculated normally. If const is FALSE, b is set equal to 1, and the m-values are fitted to y = m^x. Stats is a logical value specifying whether to return additional regression statistics. If stats is TRUE, LOGEST returns the additional regression statistics, so the returned array is {mn,mn1,...,m1,b;sen,sen-1,...,se1,seb;r 2,sey; F,df;ssreg,ssresid}. If stats is FALSE or omitted, LOGEST returns only the m-coefficients and the constant b. For more information about additional regression statistics, see LINEST. Remarks The more a plot of your data resembles an exponential curve, the better the calculated line will fit your data. Like LINEST, LOGEST returns an array of values that describes a relationship among the values, but LINEST fits a straight line to your data; LOGEST fits an exponential curve. For more information, see LINEST. When you have only one independent x-variable, you can obtain y-intercept (b) values directly by using the following formula: Y-intercept (b): INDEX(LOGEST(known_y's,known_x's),2) You can use the y = b*m^x equation to predict future values of y, but Microsoft Excel provides the GROWTH function to do this for you. For more information, see GROWTH. Formulas that return arrays must be entered as array formulas. When entering an array constant such as known_x's as an argument, use commas to separate values in the same row and semicolons to separate rows. Separator characters may be different depending on your locale setting in Regional Settings or Regional Options in Control Panel. You should note that the y-values predicted by the regression equation may not be valid if they are outside the range of y-values you used to determine the equation. The GROWTH function Where the LOGEST function returns a mathematical description of the exponential regression curve that best fits a set of known data, the GROWTH function finds points that lie along that curve. The GROWTH function works like its linear counterpart, TREND, and takes the form = GROWTH(known_y's, known_x's, new_x's, const).

Master MSS 2012 2013

17

GROWTH(known_y's,known_x's,new_x's,const) Known_y's is the set of y-values you already know in the relationship y = b*m^x. If the array known_y's is in a single column, then each column of known_x's is interpreted as a separate variable. If the array known_y's is in a single row, then each row of known_x's is interpreted as a separate variable. If any of the numbers in known_y's is 0 or negative, GROWTH returns the #NUM! error value. Known_x's is an optional set of x-values that you may already know in the relationship y = b*m^x. The array known_x's can include one or more sets of variables. If only one variable is used, known_y's and known_x's can be ranges of any shape, as long as they have equal dimensions. If more than one variable is used, known_y's must be a vector (that is, a range with a height of one row or a width of one column). If known_x's is omitted, it is assumed to be the array {1,2,3,...} that is the same size as known_y's. New_x's are new x-values for which you want GROWTH to return corresponding y-values. New_x's must include a column (or row) for each independent variable, just as known_x's does. So, if known_y's is in a single column, known_x's and new_x's must have the same number of columns. If known_y's is in a single row, known_x's and new_x's must have the same number of rows. If new_x's is omitted, it is assumed to be the same as known_x's. If both known_x's and new_x's are omitted, they are assumed to be the array {1,2,3,...} that is the same size as known_y's. Const is a logical value specifying whether to force the constant b to equal 1. If const is TRUE or omitted, b is calculated normally. If const is FALSE, b is set equal to 1 and the m-values are adjusted so that y = m^x. Remarks Formulas that return arrays must be entered as array formulas after selecting the correct number of cells. When entering an array constant for an argument such as known_x's, use commas to separate values in the same row and semicolons to separate rows.

S-ar putea să vă placă și