Logistic Regression (part 3.)

Problems of Estimation (ctd.)

Applying Bias Reduced Regression to the Hormone Example

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Coeff. p-value 95% Limits
(Intercept) 0.000000e+00 0.9995046 0 Inf
estrogen 0.000000e+00 0.9990517 0 Inf
androgen 6.657128e+43 0.9991320 0 Inf

M.L.E. fit, Fitted Values and Contour Plots of Probabilities

## 1 | 2: represents 0.12
##  leaf unit: 0.01
##             n: 26
##   (15)    0 | 000000000000000
##           1 | 
##           2 | 
##           3 | 
##           4 | 
##           5 | 
##           6 | 
##           7 | 
##           8 | 
##    11     9 | 9
##    10    10 | 10101010101010101010

## 1 | 2: represents 0.12
##  leaf unit: 0.01
##             n: 26
##   12     0 | 000000112336
##   (1)    1 | 4
##   (2)    2 | 56
##          3 | 
##          4 | 
##          5 | 
##   11     6 | 6
##          7 | 
##   10     8 | 7
##    9     9 | 234667789
Coeff. p-value 95% Limits
(Intercept) 0.0259852 0.2097630 0.0000866 7.7988764
estrogen 0.0277284 0.0167279 0.0014704 0.5229104
androgen 58.7449870 0.0119570 2.4520149 1407.4031647

Bias reduced fit, Fitted Values and Contour Plots of Probabilities

\(O(1/n) \rightarrow\)

\[ \downarrow \]

Interpreting log odds ratio output

\(\rightarrow\)

Goodness of fit

Pearson’s goodness of fit statistics

General Case

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \frac { \left( O _ { i } - E _ { i } \right) ^ { 2 } } { E _ { i } } \]

Replicated binary case ( \(Y_i \sim \bf{Binom}\left( n_i, p_i \right)\) )

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \frac { \left( y _ { i } - n _ { i } \hat { p } _ { i } \right) ^ { 2 } } { n _ { i } \hat { p } _ { i } \left( 1 - \hat { p } _ { i } \right) } \]

approximates the Residual Deviance

Pearson residual

\[ r _ { i } ^ { P } = \left( y _ { i } - n _ { i } \hat { p } _ { i } \right) / \sqrt { \operatorname { var } } \hat { y } _ { i } \]

\(\rightarrow\)

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \left( r _ { i } ^ { P } \right) ^ { 2 } \]

  • can also consider deviance residual

\[ r_{ i }^{D} = \sqrt { 2 \left[ y _ { i } \log \left( \frac { y _ { i } } { n \hat { p }_ { i } } \right) + \left( n _ { i } - y _ { i } \right) \log \left( \frac { n _ { i } - y _ { i } } { n _ { i } - n_i \hat {p} _ { i } } \right) \right] } \]

\[ Residual~~Deviance = \sum _ { i = 1 } ^ { n } \left( r _ { i } ^ { D } \right) ^ { 2 } \]

Measuring the predictive accuracy of the model - \(R^2\) analogues

Nagelkirke’s statistic

\[ R ^ { 2 } = \frac { 1 - \left( \hat { L } _ { 0 } / \hat { L } \right) ^ { 2 / n } } { 1 - \hat { L } _ { o } ^ { 2 / n } } = \frac { 1 - \exp \left( \left( D - D _ { \text {null} } \right) / n \right) } { 1 - \exp \left( - D _ { \text { null } } / n \right) } \]

\(\rightarrow\)

\[\downarrow\]

Prospective and Retrospective Designs

\(\rightarrow\)

More on prediction

Standard errors, confidence intervals for predictions

\(\rightarrow\)

ED50

\(\rightarrow\)

Overdispersion

Trout Eggs Example

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
## location2    -0.4168     0.2461  -1.694   0.0903 .  
## location3    -1.2421     0.2194  -5.660 1.51e-08 ***
## location4    -0.9509     0.2288  -4.157 3.23e-05 ***
## location5    -4.6138     0.2502 -18.439  < 2e-16 ***
## period7      -2.1702     0.2384  -9.103  < 2e-16 ***
## period8      -2.3256     0.2429  -9.573  < 2e-16 ***
## period11     -2.4500     0.2341 -10.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(survive, total - survive)
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               19    1021.47              
## location         4   792.90        15     228.57 < 2.2e-16 ***
## period           3   164.08        12      64.50 < 2.2e-16 ***
## location:period 12    64.50         0       0.00 3.379e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The validity of the test for interaction depends on the assumption of independence

This assumption is questionable given the study design

  • Eggs in a given box may come from same cluster
  • There may be strong environmental effects that vary over time

Suppose that survival outcomes (0/1) of eggs within box have positive correlation \(\rho\) Can write variance of \(Y_i\) (the number of surviving eggs) as

\[ Var(Y_i) = m_i p_i ( 1-p_i ) \left\{ 1 + ( m_i - 1) \rho \right\} \]

where \(m_i\) is the number of eggs in box \(i\). The expression \(1 + (m_i -1 ) \rho\) gives the proportional increase in variance due to correlation. This expression is used as a sample size inflation factor in cluster-randomized trials.

Quasi-likelihood

A clever and easily implemented method to account for overdispersion is to extend the generalized linear model in an ad hoc fashion with a dispersion coefficient, which is analogous to the residual error variance \(\sigma_{\epsilon}^2\) in a linear model.

The overdispersion parameter is derived by dividing the Pearson goodness of fit statistic by its degrees of freedeom, i.e. \(\hat{\sigma}^2 = \frac{X^2}{residual~ d.f.}\) This estimate can be used to inflate the standard errors of the \(\beta\) estimates to allow for overdispersion.

sigma2 <- sum(residuals(lgtfit,type="pearson")^2)/12
lgtfit <- glm(cbind(survive,total-survive)   ~ location+period, family=binomial,troutegg)
summary(lgtfit,dispersion=sigma2)
## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.6495   7.138 9.49e-13 ***
## location2    -0.4168     0.5682  -0.734   0.4632    
## location3    -1.2421     0.5066  -2.452   0.0142 *  
## location4    -0.9509     0.5281  -1.800   0.0718 .  
## location5    -4.6138     0.5777  -7.987 1.39e-15 ***
## period7      -2.1702     0.5504  -3.943 8.05e-05 ***
## period8      -2.3256     0.5609  -4.146 3.38e-05 ***
## period11     -2.4500     0.5405  -4.533 5.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 5.330322)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

The same result can be obtained automatically by using the quasibinomial link.

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = quasibinomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6358     0.6495   7.138 1.18e-05 ***
## location2    -0.4168     0.5682  -0.734 0.477315    
## location3    -1.2421     0.5066  -2.452 0.030501 *  
## location4    -0.9509     0.5281  -1.800 0.096970 .  
## location5    -4.6138     0.5777  -7.987 3.82e-06 ***
## period7      -2.1702     0.5504  -3.943 0.001953 ** 
## period8      -2.3256     0.5609  -4.146 0.001356 ** 
## period11     -2.4500     0.5405  -4.533 0.000686 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 5.330358)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Note that the p-values in the above analysis are derived from the t distribution

Random Effects Models

While the quasi-likelihood provides a convenient adjustment for overdispersion, modern computing facilities permit a more theoretically satisfactory analysis using a *random effects$ approach.

The fixed effect model takes \(\eta_i\) to be the non-random quantity \(\mathbf{x}_i \mathbf{\beta}\). Adding an additional random component \(\epsilon_i \sim N(0,\sigma^2)\), i.e. \(\eta_i = \mathbf{x}_i \mathbf{\beta} + \epsilon_i\) also gives rise to an \(overdispersion\) model, although there is no closed form expression for the variance due to the non-linear relation between \(\eta_i\) and \(p_i\).

Such models are implemented in the \(lme4\) package, as follows

troutegg$box <- with(troutegg, factor( paste(location,period,sep=":")))
mefit <- glmer(cbind(survive, total-survive) ~ location + period + (1 | box), 
             family=binomial, data=troutegg)
summary(mefit,correlation=FALSE)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(survive, total - survive) ~ location + period + (1 | box)
##    Data: troutegg
## 
##      AIC      BIC   logLik deviance df.resid 
##      148      157      -65      130       11 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.07855 -0.16858  0.01644  0.26816  1.31293 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  box    (Intercept) 0.2195   0.4685  
## Number of obs: 20, groups:  box, 20
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.4483     0.4385  10.143  < 2e-16 ***
## location2    -0.2708     0.4366  -0.620  0.53505    
## location3    -1.1040     0.4174  -2.645  0.00817 ** 
## location4    -0.7207     0.4368  -1.650  0.09896 .  
## location5    -4.6067     0.4296 -10.724  < 2e-16 ***
## period7      -2.0500     0.4087  -5.016 5.27e-07 ***
## period8      -2.1519     0.4058  -5.303 1.14e-07 ***
## period11     -2.5133     0.4145  -6.063 1.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1