Generalized Estimating Equations (GEE)
Assume we have K groups of binary observations:
- \(\mathbf{Y}_i\) representing \(n_i\) binary observations, \(i=1, \ldots , K\)
- As usual \(E \left( \mathbf{Y}_i \right) = \mathbf{p}_i\)
- \(logit\left(\mathbf{p}_i \right) = \mathbf{\eta}_i = \mathbf{X}_i \mathbf{\beta}\).
Under assumptions of independence, score equation takes the form
\[ \sum _ { i = 1 }^K \left( \frac { \partial {p} _ { i } } { \partial \beta } \right) ^ { T } \operatorname { var } \left( Y _ { i } \right) ^ { - 1 } \left( Y _ { i } - p _ { i } \right) = 0\]
where \(var(Y_i)\) is a diagonal matrix which depends on \(\eta_i\).
- also covers case of correlation within the elements of \(\mathbf{Y}_i\) .
Quasi-likelihood approach
- assume correlation matrix \(\mathbf{R}_i\) to be a function of a parameter \(\alpha\)
- based on (ad hoc?) estimate \(\alpha\) can estimate of \(\beta\)
- if \(\mathbf{R}(\alpha)\) correctly specified then have usual variance for \(\hat{\beta}\)
\[\operatorname { var } ( \hat { \beta } ) = \left( X ^ { T } W X \right) ^ { - 1 }\] where \(W = Var(\mathbf{Y})\), (diagonal) variance matrix of \(Y\).
- typically not sure of correct form for \(\mathbf{R}\)
- mathematically feasible correlation matrices generally depend on \(\eta\)’s
- for that reason, refer to \(\mathbf{R}\) as working correlation structure - i.e. we know it’s not correct, but perhaps good as working assumption
If we have not correctly specified \(R\), we have the following asymptotic variance
\[Var \{ \hat{\mathbf{\beta}} \} = \lim _ { K \rightarrow \infty } K \left( \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } D _ { i } \right) ^ { - 1 } \left\{ \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } \operatorname { cov } \left( Y _ { i } \right) V _ { i } ^ { - 1 } D _ { i } \right\} \left( \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } D _ { i } \right) ^ { - 1 }\]
where \(cov \left( Y_i \right)\) is the true covariance.
\[( \mathbf{Y}_i - \hat{\mathbf{p}_i} ) ( \mathbf{Y}_i - \hat{\mathbf{p}_i})^T\]
- with above estimate we obtain robust estimate for the variance-covariance matrix of \(\hat{\mathbf{\beta}}\)
- structure of expression above leads to terminology sandwich estimator.
Example (Faraway)
Experiment to study the effects of surface and vision on balance.
- 40 subjects (20 each M/F)
- two experimental factors - standing surface (normal or foam) - vision condition (eyes closed/open, or eyes open with dome over head)
- outcome = experimenter assessment of stability of subject
- dichotomized for illustration purposes
- within subject replication - two times
- 12 observations per subject
preliminary inspection
## stable age
## Unstable:366(76.2%) Mean( SD , N )
## Stable :114(23.8%) Unstable 26.8(6.35,366)
## Stable 26.9(6.41,114)
## s=6.4,r2=0.00011
##
## sex
## Unstable Stable
## female 193/240(80.4%) 47/240(19.6%)
## male 173/240(72.1%) 67/240(27.9%)
##
##
## surface
## Unstable Stable
## foam 230/240(95.8%) 10/240( 4.2%)
## norm 136/240(56.7%) 104/240(43.3%)
##
##
## vision
## Unstable Stable
## closed 143/160(89.4%) 17/160(10.6%)
## dome 138/160(86.2%) 22/160(13.8%)
## open 85/160(53.1%) 75/160(46.9%)
##
Ordinary logistic regression
## Coeff. p-value 95% Limits
## (Intercept) -5.8000 3.8e-12 -7.500 -4.200
## age 0.0082 7.2e-01 -0.037 0.053
## sexmale 0.8400 4.9e-03 0.260 1.400
## surfacenorm 3.8000 7.6e-19 2.900 4.600
## visiondome 0.3400 3.6e-01 -0.390 1.100
## visionopen 3.0000 2.8e-14 2.200 3.800
GEE with working independence assumption (i.e. R= diagonal, 1’s)
geefit1 <- geeglm( I(stable=="Stable") ~ age + sex + surface + vision, id=subject, family=binomial,data=ctsib)
summary(geefit1)
##
## Call:
## geeglm(formula = I(stable == "Stable") ~ age + sex + surface +
## vision, family = binomial, data = ctsib, id = subject)
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -5.820702 1.353114 18.505 1.69e-05 ***
## age 0.008244 0.040074 0.042 0.837
## sexmale 0.840758 0.568497 2.187 0.139
## surfacenorm 3.780236 0.478625 62.380 2.89e-15 ***
## visiondome 0.344369 0.389482 0.782 0.377
## visionopen 3.023124 0.422501 51.198 8.35e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.7156 0.2925
##
## Correlation: Structure = independenceNumber of clusters: 40 Maximum cluster size: 12
GEE, equi-correlation structure
geefit2 <- geeglm( I(stable=="Stable") ~ age + sex + surface + vision, id=subject, family=binomial,data=ctsib,corstr="exchangeable")
summary(geefit2)
##
## Call:
## geeglm(formula = I(stable == "Stable") ~ age + sex + surface +
## vision, family = binomial, data = ctsib, id = subject, corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) -5.462658 1.467088 13.86 0.0002 ***
## age 0.000472 0.044402 0.00 0.9915
## sexmale 0.895973 0.607566 2.17 0.1403
## surfacenorm 3.698260 0.488029 57.43 3.5e-14 ***
## visiondome 0.336671 0.369896 0.83 0.3627
## visionopen 2.992531 0.422202 50.24 1.4e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 0.691 0.276
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.226 0.112
## Number of clusters: 40 Maximum cluster size: 12
Mixed effects model, random subject term
mefit <- glmer( I(stable=="Stable") ~ age + sex + surface + vision + (1 | subject),
family=binomial, data=ctsib)
summary(mefit)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula:
## I(stable == "Stable") ~ age + sex + surface + vision + (1 | subject)
## Data: ctsib
##
## AIC BIC logLik deviance df.resid
## 248 277 -117 234 473
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.771 -0.140 -0.016 0.000 5.532
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 10.2 3.19
## Number of obs: 480, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.7617 3.0328 -3.88 0.00011 ***
## age -0.0026 0.0858 -0.03 0.97580
## sexmale 1.8474 1.1197 1.65 0.09896 .
## surfacenorm 7.7872 1.2555 6.20 5.6e-10 ***
## visiondome 0.6889 0.5337 1.29 0.19672
## visionopen 6.5460 1.1640 5.62 1.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) age sexmal srfcnr visndm
## age -0.744
## sexmale -0.305 -0.014
## surfacenorm -0.586 -0.018 0.216
## visiondome -0.167 0.001 0.033 0.116
## visionopen -0.578 -0.018 0.216 0.874 0.345
We see a marked difference between GEE and mixed effects approach
Collapsibility
Relationships between three variables \(X\), \(Y\), and \(Z\) are said to be collapsible if the joint distribution between \(X\), \(Y\), and \(Z\) can be accurately inferred from the three pair-wise marginal distributions
A complicating feature of the logistic regression model and other models based on log odds is non-collapsibility. A simple but somewhat extreme example is illustrated below.
binary response \(y\) with two binary predictors \(x\) and \(z\)
print(tb)
## , , z = 0
##
## y
## x 0 1
## 0 24 1
## 1 23 2
##
## , , z = 1
##
## y
## x 0 1
## 0 2 23
## 1 1 24
fit <- glm( y ~ x + z, family=binomial,data=df)
coefTable(fit,trans=exp)
## Coeff. p-value 95% Limits
## (Intercept) 0.0417 1.45e-04 0.00809 0.215
## x 2.0870 4.09e-01 0.36452 11.948
## z 276.0000 2.73e-10 48.20741 1580.172
fitc <- glm( y ~ x , family=binomial,data=df)
coefTable(fitc,trans=exp)
## Coeff. p-value 95% Limits
## (Intercept) 0.923 0.777 0.530 1.61
## x 1.174 0.689 0.536 2.57
Knee jerk explanation of difference is confounding between \(x\) and \(z\) - below shows isn’t so.
fitz <- glm( x ~ z , family=binomial,data=df)
coefTable(fitz,trans=exp)
## Coeff. p-value 95% Limits
## (Intercept) 1 1 0.574 1.74
## z 1 1 0.457 2.19
A more thoughtful perspective is examination of odds ratios