Logistic Regression (part 3.)

Problems of Estimation (ctd.)

Applying Bias Reduced Regression to the Hormone Example

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

	Coeff.	p-value
(Intercept)	0.000000e+00	0.9995046	Inf
estrogen	0.000000e+00	0.9990517	Inf
androgen	6.657128e+43	0.9991320	Inf

M.L.E. fit, Fitted Values and Contour Plots of Probabilities

## 1 | 2: represents 0.12
##  leaf unit: 0.01
##             n: 26
##   (15)    0 | 000000000000000
##           1 | 
##           2 | 
##           3 | 
##           4 | 
##           5 | 
##           6 | 
##           7 | 
##           8 | 
##    11     9 | 9
##    10    10 | 10101010101010101010

## 1 | 2: represents 0.12
##  leaf unit: 0.01
##             n: 26
##   12     0 | 000000112336
##   (1)    1 | 4
##   (2)    2 | 56
##          3 | 
##          4 | 
##          5 | 
##   11     6 | 6
##          7 | 
##   10     8 | 7
##    9     9 | 234667789

	Coeff.	p-value	95% Limits
(Intercept)	0.0259852	0.2097630	0.0000866	7.7988764
estrogen	0.0277284	0.0167279	0.0014704	0.5229104
androgen	58.7449870	0.0119570	2.4520149	1407.4031647

Bias reduced fit, Fitted Values and Contour Plots of Probabilities

$O(1/n) \rightarrow$

\[ \downarrow \]

Interpreting log odds ratio output

$\rightarrow$

Goodness of fit

Pearson’s goodness of fit statistics

General Case

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \frac { \left( O _ { i } - E _ { i } \right) ^ { 2 } } { E _ { i } } \]

Replicated binary case ( $Y_i \sim \bf{Binom}\left( n_i, p_i \right)$ )

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \frac { \left( y _ { i } - n _ { i } \hat { p } _ { i } \right) ^ { 2 } } { n _ { i } \hat { p } _ { i } \left( 1 - \hat { p } _ { i } \right) } \]

approximates the Residual Deviance

Pearson residual

\[ r _ { i } ^ { P } = \left( y _ { i } - n _ { i } \hat { p } _ { i } \right) / \sqrt { \operatorname { var } } \hat { y } _ { i } \]

$\rightarrow$

\[ X ^ { 2 } = \sum _ { i = 1 } ^ { n } \left( r _ { i } ^ { P } \right) ^ { 2 } \]

can also consider deviance residual

\[ r_{ i }^{D} = \sqrt { 2 \left[ y _ { i } \log \left( \frac { y _ { i } } { n \hat { p }_ { i } } \right) + \left( n _ { i } - y _ { i } \right) \log \left( \frac { n _ { i } - y _ { i } } { n _ { i } - n_i \hat {p} _ { i } } \right) \right] } \]

\[ Residual~~Deviance = \sum _ { i = 1 } ^ { n } \left( r _ { i } ^ { D } \right) ^ { 2 } \]

Measuring the predictive accuracy of the model - $R^2$ analogues

Nagelkirke’s statistic

\[ R ^ { 2 } = \frac { 1 - \left( \hat { L } _ { 0 } / \hat { L } \right) ^ { 2 / n } } { 1 - \hat { L } _ { o } ^ { 2 / n } } = \frac { 1 - \exp \left( \left( D - D _ { \text {null} } \right) / n \right) } { 1 - \exp \left( - D _ { \text { null } } / n \right) } \]

$\rightarrow$

\[\downarrow\]

Prospective and Retrospective Designs

$\rightarrow$

Weaknesses of a Case Control Design

Choice of Link Function

$\rightarrow$

Relative risk regression using the log link

When outcomes are not common, but not so uncommon as to justify a relative risk interpretation for odds ratios, and alternative is to use the log link, $\eta_i = log(p_i)$.

Overdispersion

Trout Eggs Example

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
## location2    -0.4168     0.2461  -1.694   0.0903 .  
## location3    -1.2421     0.2194  -5.660 1.51e-08 ***
## location4    -0.9509     0.2288  -4.157 3.23e-05 ***
## location5    -4.6138     0.2502 -18.439  < 2e-16 ***
## period7      -2.1702     0.2384  -9.103  < 2e-16 ***
## period8      -2.3256     0.2429  -9.573  < 2e-16 ***
## period11     -2.4500     0.2341 -10.466  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(survive, total - survive)
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               19    1021.47              
## location         4   792.90        15     228.57 < 2.2e-16 ***
## period           3   164.08        12      64.50 < 2.2e-16 ***
## location:period 12    64.50         0       0.00 3.379e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The validity of the test for interaction depends on the assumption of independence

This assumption is questionable given the study design

Eggs in a given box may come from same cluster
There may be strong environmental effects that vary over time

Suppose that survival outcomes (0/1) of eggs within box have positive correlation $\rho$ Can write variance of $Y_i$ (the number of surviving eggs) as

\[ Var(Y_i) = m_i p_i ( 1-p_i ) \left\{ 1 + ( m_i - 1) \rho \right\} \]

where $m_i$ is the number of eggs in box $i$. The expression $1 + (m_i -1 ) \rho$ gives the proportional increase in variance due to correlation. This expression is used as a sample size inflation factor in cluster-randomized trials.

Quasi-likelihood

A clever and easily implemented method to account for overdispersion is to extend the generalized linear model in an ad hoc fashion with a dispersion coefficient, which is analogous to the residual error variance $\sigma_{\epsilon}^2$ in a linear model.

The overdispersion parameter is derived by dividing the Pearson goodness of fit statistic by its degrees of freedeom, i.e. $\hat{\sigma}^2 = \frac{X^2}{residual~ d.f.}$ This estimate can be used to inflate the standard errors of the $\beta$ estimates to allow for overdispersion.

sigma2 <- sum(residuals(lgtfit,type="pearson")^2)/12
lgtfit <- glm(cbind(survive,total-survive)   ~ location+period, family=binomial,troutegg)
summary(lgtfit,dispersion=sigma2)

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = binomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.6358     0.6495   7.138 9.49e-13 ***
## location2    -0.4168     0.5682  -0.734   0.4632    
## location3    -1.2421     0.5066  -2.452   0.0142 *  
## location4    -0.9509     0.5281  -1.800   0.0718 .  
## location5    -4.6138     0.5777  -7.987 1.39e-15 ***
## period7      -2.1702     0.5504  -3.943 8.05e-05 ***
## period8      -2.3256     0.5609  -4.146 3.38e-05 ***
## period11     -2.4500     0.5405  -4.533 5.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 5.330322)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: 157.03
## 
## Number of Fisher Scoring iterations: 5

The same result can be obtained automatically by using the quasibinomial link.

## 
## Call:
## glm(formula = cbind(survive, total - survive) ~ location + period, 
##     family = quasibinomial, data = troutegg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8305  -0.3650  -0.0303   0.6191   3.2434  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6358     0.6495   7.138 1.18e-05 ***
## location2    -0.4168     0.5682  -0.734 0.477315    
## location3    -1.2421     0.5066  -2.452 0.030501 *  
## location4    -0.9509     0.5281  -1.800 0.096970 .  
## location5    -4.6138     0.5777  -7.987 3.82e-06 ***
## period7      -2.1702     0.5504  -3.943 0.001953 ** 
## period8      -2.3256     0.5609  -4.146 0.001356 ** 
## period11     -2.4500     0.5405  -4.533 0.000686 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 5.330358)
## 
##     Null deviance: 1021.469  on 19  degrees of freedom
## Residual deviance:   64.495  on 12  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Note that the p-values in the above analysis are derived from the t distribution

Random Effects Models

While the quasi-likelihood provides a convenient adjustment for overdispersion, modern computing facilities permit a more theoretically satisfactory analysis using a *random effects$ approach.

The fixed effect model takes $\eta_i$ to be the non-random quantity $\mathbf{x}_i \mathbf{\beta}$. Adding an additional random component $\epsilon_i \sim N(0,\sigma^2)$, i.e. $\eta_i = \mathbf{x}_i \mathbf{\beta} + \epsilon_i$ also gives rise to an $overdispersion$ model, although there is no closed form expression for the variance due to the non-linear relation between $\eta_i$ and $p_i$.

Such models are implemented in the $lme4$ package, as follows

troutegg$box <- with(troutegg, factor( paste(location,period,sep=":")))
mefit <- glmer(cbind(survive, total-survive) ~ location + period + (1 | box), 
             family=binomial, data=troutegg)
summary(mefit,correlation=FALSE)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(survive, total - survive) ~ location + period + (1 | box)
##    Data: troutegg
## 
##      AIC      BIC   logLik deviance df.resid 
##      148      157      -65      130       11 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.07855 -0.16858  0.01644  0.26816  1.31293 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  box    (Intercept) 0.2195   0.4685  
## Number of obs: 20, groups:  box, 20
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.4483     0.4385  10.143  < 2e-16 ***
## location2    -0.2708     0.4366  -0.620  0.53505    
## location3    -1.1040     0.4174  -2.645  0.00817 ** 
## location4    -0.7207     0.4368  -1.650  0.09896 .  
## location5    -4.6067     0.4296 -10.724  < 2e-16 ***
## period7      -2.0500     0.4087  -5.016 5.27e-07 ***
## period8      -2.1519     0.4058  -5.303 1.14e-07 ***
## period11     -2.5133     0.4145  -6.063 1.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logistic Regression (part 3.)

Problems of Estimation (ctd.)

Applying Bias Reduced Regression to the Hormone Example

M.L.E. fit, Fitted Values and Contour Plots of Probabilities

Bias reduced fit, Fitted Values and Contour Plots of Probabilities

Interpreting log odds ratio output

Goodness of fit

Pearson’s goodness of fit statistics

Pearson residual

Measuring the predictive accuracy of the model - \(R^2\) analogues

Prospective and Retrospective Designs

Weaknesses of a Case Control Design

Choice of Link Function

Relative risk regression using the log link

More on prediction

Standard errors, confidence intervals for predictions

ED50

Overdispersion

Trout Eggs Example

Quasi-likelihood

Random Effects Models