Sunteți pe pagina 1din 14

Tutorial pentru Modelul liniar generalizat in SPSS: http://faculty.chass.ncsu.edu/garson/PA765/gzlm_gee.

htm

Regresia Poisson / binomiala negativa


Notite pentru aceasta parte: 1. http://data.princeton.edu/wws509/notes/; Poisson Models for Count
Data si Generalized Linear Model Theory (pentru testele de validare).

2. http://www.ed.uiuc.edu/courses/EDpsy490AT/lectures/4glm1.pdf. 3. http://www.ed.uiuc.edu/courses/EdPsy490AT/lectures/4glm3-ha-online.pdf

O variabila de tip count este o variabila care poate lua doar valori intregi non-negative, aceste valori rezultand in urma numararii (counting). Exemple ar include numarul de arestari ale unei persoane pe parcursul unui an, numarul de tigari fumate pe zi, numarul de urgente inregistrate la un spital intr-o saptamana. Datele de tip Count tratate ca si variabile aleatoare sunt distribuite in cele mai multe din cazuri dupa legea Poisson, binomiala sau binomiala negativa. Daca Y este o variabila de tip count, suntem interesati de multe ori in a vedea in ce masura media variabilei Y este influentata de anumiti factori X. Cea mai la indemana abordare ar fi un model liniar Yi = + x i + e i E(Y i ) = + x i , cu cerinta Y i N(, 2 ), estimat prin OLS . Pentru variabilele de tip Count, modelele liniare clasice nu sunt adecvate. Metode adecvate pentru date de tip Count (modelul de regresie Poisson, modelul binomial negativ, Hurdle Models, Random-Effects Count Models). Dintre acestea, cel mai popular este modelul de regresie Poisson, un caz particular de Generalised Linear Model (GLM). In modelele din categoria GLM, poate avea o alta distributie decat cea normala, orice distributie din clasa exponential family distribution. Pentru functiile exponentiale, functia densitate de probabilitate este de forma: f (y i ; i ) = exp {y i b( i ) c( i )+d(y i )}. In plus, relatia dintre variabila raspuns/ dependenta si variabilele explicative poate sa fie alta decat cea de identitate: g(Y) = +x. Estimatorii se obtin prin metoda verosimilitatii maxime. Asadar, relatia dintre E(Y) = si variabilele explicative, poate sa fie non-liniara si se modeleaza printr-o functie de legatura g() monotona .

In general, funcia de verosimilitate asociat unui eantion de n observaii aleatoare Y1 , Y2 , , Yn


independente, relativ la variabila aleatoare Y, se definete prin densitatea de probabilitate multivariata, care atunci cand variabilele de selectie Y1 , Y2 , , Yn sunt independente si identic repartizate, devine:

L( , y1 , y 2 ,..., y n ) = f ( y1 , y 2 ,..., y n ) = f ( y i , )
i =1

unde = (1 , 2 ,..., n ) este vectorul parametrilor (necunoscui) iar f ( y i , ) densitatea de probabilitate a variabilei Y. Maximizarea acesteia, n raport cu parametri, furnizeaz estimatori pentru = (1 , 2 ,..., n ) vectorul parametrilor. De regula se maximixeaza logaritmul acesteia:

log[ L( , y1 , y 2 ,..., y n ) ] = log[ f ( y i , )] = log( f ( y i , )) . Ideea metodei: pentru ce valori ale parametrilor = (1 , 2 ,..., n ) probabilitatea ca ( Y1 , Y2 , , Yn )
sa inregistreze valorile observate in esantion ( y1 , y 2 ,..., y n ) este maxima?
i =1 i =1

Modelul Poisson Presupunem ca avem un esantion de n observatii variabile aleatoare independente Poisson, cu care pot fi tratate ca realizare de si presupunem ca dorim sa lasam media

=E(Y) sa depinde de un vector de variabile explicative. Putem estima un model liniar de forma , dar acest model are dezavantajul ca predictorul liniar din partea dreapta poate lua orice valoare, pe cand media Poisson din partea stanga trebuie sa fie non-negativa (pozitiva). O solutie simpla ar fi sa modelam logaritmul mediei, considerand un model liniar. Asadar, sa consideram un GLM cu legatura (link) de tip log:

Pentru distributia Poisson avem: f(y; )=

y e y e )) f ( y;) = exp(log( y! y!

f(y; ) = exp( y log( ) log( y!)) = exp( yb() c() + d ( y)) Distributia Poisson apartine familiei de functii exponentiale, avand ca functia de legatura, functia logaritmica. Componenta aleatoare in acest caz este variabila de raspuns Y, urmand o lege Poisson, iar componenta sistematica este data de vectorul de regresori, care intra in model intr-o structura liniara: distributia Poisson. ; legatura dintre cele doua componente este data de functia . Link-ul este un link canonic pentru de legatura (link), care in acest caz este functia

In acest model, coeficientul

reprezinta schimbarea asteptata in logaritmul mediei, la o . .

schimbare cu o unitate a predictorului

Deci,

reprezinta un efect multiplicativ, cand

creste cu o unitate, media lui Y se

modifica de exp( ) ori. Deci exp( ) arata de cate ori se modifica in medie Y atunci cand creste cu o unitate. Observatie. Inegalitatea intre medie si varianta pentru modelul Poisson Pentru variabilele de tip count data care urmeaza o distributie de tip Poisson ne asteptam ca varianta sa fie egala cu media. Insa, de multe ori, datorita eterogenitatii, varianta este mai mare decat media. Aceasta inseamna ca vorbim de incosistenta parametrilor: varianta si eroarea standard estimate pentru parametrii modelului sunt prea mici, iar statisticile utilizate in testarea ipotezei nule = 0 sunt false (prea mari). In acest sens este necesar fie o ajustare a erorii standard estimate, fie folosirea unei distributii alternative, precum cea binomiala negativa. Exemplu. S-a selectat un esantion aleator (se indica un procent, in Data/Select Cases/Random) din fisierul nrdaune. Se cere: elaborati un model de tip Poisson sau Binomial negativ pentru variabila dependenta Y numarul daunelor. Se rezolva in SPSS. Am incercat elaborarea unui model Poisson pentru Y=nrdaune avand ca regresori capacitatea, cilindrica, numarul de locuri, vechimea, tipul persoanei si producatorul. Din statisticile descriptive de mai jos, se observa ca cei mai multi sunt persoane juridice (68.7%), si au masini produse de producatori romani(60.5%), cu o capacitate medie cilindrica de 2294.182 si au in medie 4.73 locuri.
Categorical Variable Information N Factor tipul persoanei persoana fizica persoana juridica Total Producator roman strain Total 3635 7962 11597 7014 4583 11597 Percent 31.3% 68.7% 100.0% 60.5% 39.5% 100.0%

Continuous Variable Information N Dependent Variable Covariate Numarul de daune Capacitatea cilindrica Numarul de locuri vechime 11597 11597 11597 11597 Minimum .00 2.00 1.00 .00 Maximum 8.00 28000.00 99.00 38.00 Mean .2824 2294.1482 4.7329 6.6202 Std. Deviation .71500 2447.68808 3.88015 3.93768

Tabelul Goodness of Fit arata discrepanta dintre valorile observate si valorile fitted. Cu cat valorile sunt mai mici, cu atat modelul este mai bun. Se observa ca devianta=11478.310 iar Coeficientul Chi-Square a lui Pearson este 19878.194, iar functia Log-likelihood finala a modelului este -81838.307.Valorile deviantei si a coeficientul Chi-Square a lui Pearson ar trebui sa fie egale cu gradele sale de libertate, asadar raportul Value/d ar trebui sa fie aproape de 1.In cazul deviantei, se observa ca avem Value/df=0.99, asadar din acest punct de vedere se poate spune ca modelul Poisson este potrivit. Coeficientul lui Pearson ne da in schimb un raport de 1.7, ceea ce inseamna supradispersie, asadar din acest punct de vedere discrepanta dintre valorile observate si valorile fitted este mare.

Goodness of Fitb Value Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson Chi-Square Log Likelihooda Akaike's Information Criterion (AIC) Finite Sample Corrected AIC (AICC) Bayesian Information Criterion (BIC) Consistent AIC (CAIC) 16338.765 16332.765 16288.622 11478.310 11478.310 19878.194 19878.194 -8138.307 16288.614 df 11591 11591 11591 11591 1.715 Value/df .990

Dependent Variable: Numarul de daune Model: (Intercept), pers, marca, cap, locuri, vechime a. The full log likelihood function is displayed and used in computing information criteria. b. Information criteria are in small-is-better form.

Omnibus Test verifica semnificativitatea modelului per ansamblu. Ipoteza nula este Ho:Toti coeficientii variabilelor explicative sunt zero. In cazul nostru, Sig=0.000<5%, asadar ipoteza nula nu poate fi acceptata. Prin urmare exista coeficienti in model semnificativi.

Omnibus Testa Likelihood Ratio Chi-Square 630.261 df 5 Sig. .000

a. Compares the fitted model against the intercept-only model.

Din tabelul parametrilor, din valorile testului Wald, se poate observa ca toate variabilele introduse in model sunt semnificative, doarece in cazul fiecareia, Sig<0.05, asadar ipoteza nula conform careia coeficientul este nesemnificativ se respinge.

Parameter Estimates 95% Wald 95% Wald Confidence Interval Hypothesis Test Wald ChiParameter (Intercept) [pers=1.00] [pers=2.00] [marca=1.00] [marca=2.00] cap locuri vechime (Scale) B -.683 .120 0
a

Confidence Interval for Exp(B)

Std. Error .0420 .0398 . .0366 .

Lower -.766 .043 . -.808 .

Upper -.601 .198 . -.664 .

Square 264.200 9.177 . 404.520 . 4.531 59.130 133.482

df 1 1 . 1 . 1 1 1

Sig. Exp(B) .000 .002 . .000 . .505 1.128 1 .479 1

Lower .465 1.043 . .446 . 1.000 1.016 .931

Upper .548 1.219 . .515 . 1.000 1.027 .951

-.736 0a

1.462E-5 6.8709E-6 1.158E-6 2.809E-5 .021 -.061 1b .0028 .0053 .016 -.071 .027 -.051

.033 1.000 .000 1.022 .000 .941

Dependent Variable: Numarul de daune Model: (Intercept), pers, marca, cap, locuri, vechime a. Set to zero because this parameter is redundant. b. Fixed at the displayed value.

Interpretarea coeficientilor variabilelor: In cazul tipului persoanei (fizice/juridice) se observa ca B>0, iar Exp(B)>1, asadar numarul daunelor este mai mare in cazul persoanelor fizice, decat in cazul persoanelor juridice (categoria de referinta, 2). Numarul daunelor in cazul persoanelor fizice este cu [exp(B)-1]*100=12.8% mai mare decat in cazul persoanelor juridice. Pentru marca avem B<0, 52.1%. La capacitate cilindrica, B>0, EXP(B)>1 legatura este directa, pe masura ce capacitatea cilindrica creste nr daunelor creste. La numarul de locuri B>0, EXP(B)>1 legatura este directa, atunci cand numarul locurilor creste cu 1 nr daunelor creste in medie cu 1.6%. iar Exp(b)<1, numarul daunelor este mai mic in cazul masinilor romanesti (marca 1), decat in cazul masinilor straine (marca 2) de Exp(b)=0.479 ori, sau cu

Pentru vechime, B<0, EXP(B)<1 legatura este inversa, atunci cand vechimea creste cu 1 an numarul daunelor scade cu 6.9%.

Teste statistice Denote by l ( ) the log-likelihood function. The maximum likelihood estimator of the mdimensional vector of parameters = ( 1 , 2 ,..., m ) is asymptotically unbiased and multivariate normal distributed with the variance-covariance matrix the inverse of Fisher information matrix . The Fisher information matrix, of dimension mxm, equals minus the expected value of the second derivative of the log-likelihood function . The first derivative of the log-likelihood function is called the score; the score is asymptotically normal distributed with mean zero and the variance-covariance matrix The asymptotic variances of the components of vector parameter are the elements of the diagonals of the matrix ; the unknown parameters are substituted with their estimates. The confidence interval for a scalar parameter have the usual form: , where comes from standard normal distribution. For testing the joint null hypothesis on the vector of parameters H 0 : = 0 against the

H 1 : 0 can be used the following tests: (a) the Wald test:


asymptotically distributed as
2 (m )

is

under H 0 , (b) the likelihood ratio test

2 = 2[l(reduced model)-l(full model)] is asymptotically distributed as (m ) under H 0 , (c) the score test (or Rao test) is based on the first derivative of the log-likelihood function 2 , being asimtotically distributed as (m ) under H 0 . The denotes the MLE. The null hypothesis is rejected for large values of the test statistics. More generally, the likelihood ratio test statistic can be used to test a null hypothesis for a subset of the vector parameter against an alternative. In this case the statistic of the test is:

LR=-2(difference between the maximum values of the likelihood function under the H 0 2 respectively under the alternative H1 ), and the number of degree of freedom for the distribution is the difference between numbers of parameters estimated under H1 and H 0 . Two particular cases of significance tests are primary of interest. One is the test of significance for the regression coefficient of a covariate xi , the null being H 0 : bi = 0 dependent variable does not depend on the covariate xi . An overall significance test of the regression coefficients

H 0 : b1 = b2 = ... = bk = 0 for all k covariates included in the model is a test for overall model
adequacy; if the null H 0 is rejected than the model explains a significant proportion of the dependent variable.

Poisson Regression
http://www.unistat.com/guide/poisson-regression
The Poisson Regression procedure is suitable for models where the dependent variable is a frequency (count) variable consisting of nonnegative integers. The exponential of estimated regression coefficients are calledIncidence Rate Ratios, which give the estimated rate at which events occur. This rate can be multiplied by an Exposure variable to obtain the expected frequencies, which enters the model with a coefficient constrained as 1. Predictions (interpolations) and multicollinearity are handled as in other regression options (see, for instance, 7.2.6. Logistic Regression). Predicted cases are identified by an asterisk in Actual and Fitted Values and Case (Diagnostic) Statistics output options (see 7.2.8.3. Poisson Regression Output Options). Note that the spreadsheetReg function (see 3.4.2.6.3. UNISTAT Functions) will give the natural log of predicted values, which should be exponentiated to obtain the expected frequencies.

7.2.8.1. Poisson Regression Model Description


Poisson Regression assumes that actual frequencies yi are drawn from a Poisson distribution with parameters i, i = 1, , n. The associated probabilities are given as:

where: , i = 1, , n is known as the loglinear model. The logarithm of the likelihood function is given as:

and the first derivatives are:

A Newton-Raphson type maximum likelihood algorithm is employed to minimise the negative of the log likelihood function. The nature of this method implies that a solution (convergence) cannot always be achieved. In such cases, you are advised to edit the convergence parameters provided and try again.

7.2.8.2. Poisson Regression Variable Selection

As in other regression procedures, Poisson Regression can be used to estimate models with or without a constant term, with or without weights and regressions can be run on a subset of cases as determined by the levels of an unlimited number of factor columns. An unlimited number of dependent variables can be selected in order to run the interaction terms, dummy and lag/lead variables in the model, without having to create them as spreadsheet columns first (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables). It is compulsory to select at least one numeric data column containing frequency counts as a dependent variable. When more than one dependent variable is selected, the analysis will be repeated as many times as the number of dependent variables, each time only changing the dependent variable and keeping the rest of selections unchanged. A column containing numeric data can be selected as a weights column. Weights are frequency weights and all independent variables are multiplied by this column internally by the program. An intermediate inputs dialogue is displayed next.

Tolerance: This value is used to control the sensitivity of nonlinear minimisation procedure employed. Under normal circumstances, you do not need to edit this value. If a convergence cannot be achieved, then larger values of this parameter can be tried by removing one or more zeros.

Maximum Number of Iterations: When convergence cannot be achieved with the default value of 100 function evaluations, a higher value can be tried. Omit Level: This field will appear only when one or Dialogue. Three options are available; (0) do not omit any levels, (1) omit the first level and (2) omit the last level. When no levels are omitted, the model will usually be over-parameterised (see 2.1.4. Creating Interaction, Dummy and Lag/Lead Variables). Exposure / Offset: This field will appear only if a column has been assigned the task of [Exposure] in Variable Selection Dialogue. When the value of this field 0 the variable selected (E) will enter the model as Exposure: , i = 1, , n and for any other value it will enter the model as Offset: , i = 1, , n.

7.2.8.3. Poisson Regression Output Options

Regression Results: The main regression output displays a table of estimated coefficients for each category of the dependent variable, except for the base category. Standard errors, Wald statistics, probability values and confidence intervals are also displayed for the estimated regression coefficients. Wald Statistic: This is defined as:

and has a chi-square distribution with one degree of freedom. Confidence Intervals: The confidence intervals for regression coefficients are computed from: , i = 1, , k.

where k is the number of independent variables in the model and each coefficients standard error, i, is the square root of the diagonal element of covariance matrix. -2 Log-Likelihood Initial Model: This is the value when all independent variables are excluded from the model: , i = 1, , k. -2 Log-Likelihood Final Model: This is -2 times the value of the log likelihood function when convergence is achieved. Pseudo R-squared: In Poisson Regression, an r-squared statistic as in the OLS regression is not available. This is because Poisson Regression employs an iterative maximum likelihood estimation method. Equivalent statistics to test the goodness of fit have been proposed using the initial (L0) and maximum (L1) likelihood values. McFadden:

Adjusted McFadden:

Cox & Snell:

Nagelkerke:

Likelihood Ratio: This is a test statistic for the null hypothesis that all regression coefficients for covariates are zero. It is equal to -2 times the difference between the initial and final model likelihood values and has a chisquare distribution with k degrees of freedom (the number of independent variables in the model). Goodness of Fit: This is a test statistic to measure the goodness of fit of the expected counts on the observed counts. It has a chi-square distribution with n - k degrees of freedom (the number of valid cases minus the number of independent variables, including the constant term, if any).

Incidence Rate Ratio: Values of the incidence rate ratio indicate the influence of one unit change in a covariate on the regression. , i = 1, , k. The standard error of the incidence rate ratio is:

where i is the standard error of the ith independent variable for the jth category of the dependent variable. Coefficient confidence intervals are:

which are simply the exponential of the coefficient confidence intervals. Correlation Matrix of Regression Coefficients: This is a symmetric matrix with unity diagonal elements. The off-diagonal elements give correlations between the regression coefficients. Covariance Matrix of Regression Coefficients: This is a symmetric matrix where diagonal elements are the square of parameter standard errors. The off-diagonal elements are covariances between the regression coefficients.

Case (Diagnostic) Statistics: Case statistics are useful to determine the influence of individual observations on the overall fit of the model. For further information see 7.2.1.2.2. Linear Regression Further Output Options. Statistics available under this option are defined as follows. Fitted Values: Also known as expected values: , i = 1, , n

Standard Error of Fitted:

Confidence Intervals of Fitted:

Deviance:

Residuals:

Standardised Residuals:

Plot of Actual and Fitted Values: Select this option to plot actual and fitted Y values against row numbers (index), residuals or against any independent variable. A further dialogue will enable you to choose the X-axis variable from a list.

By default, a line graph of the two series is plotted. However, since this procedure (like the plot of residuals) uses the X-Y Plots engine, it has almost all controls and options available for X-Y Plots, except for error bars and right Y-axes. The data points on the graph will also respond to the right mouse button in the way X-Y Plots does; the point is highlighted, a panel displays information about the point and in Stand-Alone Mode, the row of the spreadsheet containing the data point is also highlighted (a procedure which is also known as Brushing or Point identification). While the point is highlighted you can press <Delete> to omit the particular row containing the point. The entire Regression Analysis will be run again without the deleted row. If you want to restore the original regression, you

will need to take one of the following two actions depending on the way you run UNISTAT: 1. 2. In Stand-Alone Mode, go back to the Data Processor and delete or deactivate the Select Row column created by the program. In Excel Add-In Mode, highlight a different block of data to remove the effect of the internal Select Row column.

Plot of Residuals: Residuals can be plotted against row numbers (index), fitted values or against any independent variable. A further dialogue will enable you to choose the X-axis variable from a list containing Row Numbers,Fitted Values and all independent variables. By default a scatter graph of residuals is plotted. For more information on available options see Plot of Actual and Fitted Values above.

7.2.8.4. Poisson Regression Examples


Example 1

Example 12.12 on p. 433 Armitage, P. & G. Berry (1994). The aim is to assess whether there is a significant difference in cancer risk between veterans and non-veterans. The servicemen are divided into 11 age groups and their experience is given in terms of subject-years. Open POISSON and select Statistics 1 Regression Analysis Poisson Regression. Select Status and Age group(L1 and L2) as [Dummy], Number of cancers (C3) as [Dependent] and Subject-years (C4) as [Exposure]. On Step 2dialogue enter 1 for Omit Level and leave other entries unchanged. Check all output options and click [Finish]. Some tables have been shortened to save space.
variable in the model as an Offset variable, which is equivalent to an Exposure variable.

Regression results show that the p-value of Status = Veteran variable is 0.9493. As this is much greater than 5%, we can conclude that there is no significant difference in cancer risk between veterans and non-veterans.

Alte linkuri utile .http://books.google.ro/books?id=tOeqO6Hs6gC&pg=PA183&lpg=PA183&dq=ordinal+regression+pdf+poisson+rate&source=bl&ots=a60Zh Duvi&sig=gl2C1DHBsiPEuE_DzGRGYwf50Qo&hl=ro&sa=X&ei=_wt4T46_L8jJtAaOoZXMBA&v ed=0CFwQ6AEwBjgo#v=onepage&q&f=false