Analysis of Correlated Data

Generalized Estimating Equations (GEE)

Assume we have K groups of binary observations:

\(\mathbf{Y}_i\) representing \(n_i\) binary observations, \(i=1, \ldots , K\)
As usual \(E \left( \mathbf{Y}_i \right) = \mathbf{p}_i\)
\(logit\left(\mathbf{p}_i \right) = \mathbf{\eta}_i = \mathbf{X}_i \mathbf{\beta}\).

Under assumptions of independence, score equation takes the form

\[ \sum _ { i = 1 }^K \left( \frac { \partial {p} _ { i } } { \partial \beta } \right) ^ { T } \operatorname { var } \left( Y _ { i } \right) ^ { - 1 } \left( Y _ { i } - p _ { i } \right) = 0\]

where \(var(Y_i)\) is a diagonal matrix which depends on \(\eta_i\).

also covers case of correlation within the elements of \(\mathbf{Y}_i\) .

Quasi-likelihood approach

assume correlation matrix \(\mathbf{R}_i\) to be a function of a parameter \(\alpha\)
based on (ad hoc?) estimate \(\alpha\) can estimate of \(\beta\)
if \(\mathbf{R}(\alpha)\) correctly specified then have usual variance for \(\hat{\beta}\)

\[\operatorname { var } ( \hat { \beta } ) = \left( X ^ { T } W X \right) ^ { - 1 }\] where \(W = Var(\mathbf{Y})\), (diagonal) variance matrix of \(Y\).

typically not sure of correct form for \(\mathbf{R}\)
mathematically feasible correlation matrices generally depend on \(\eta\)’s
for that reason, refer to \(\mathbf{R}\) as working correlation structure - i.e. we know it’s not correct, but perhaps good as working assumption

If we have not correctly specified \(R\), we have the following asymptotic variance

\[Var \{ \hat{\mathbf{\beta}} \} = \lim _ { K \rightarrow \infty } K \left( \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } D _ { i } \right) ^ { - 1 } \left\{ \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } \operatorname { cov } \left( Y _ { i } \right) V _ { i } ^ { - 1 } D _ { i } \right\} \left( \sum _ { i = 1 } ^ { K } D _ { i } ^ { \mathrm { T } } V _ { i } ^ { - 1 } D _ { i } \right) ^ { - 1 }\]

where \(cov \left( Y_i \right)\) is the true covariance.

may be estimated by

\[( \mathbf{Y}_i - \hat{\mathbf{p}_i} ) ( \mathbf{Y}_i - \hat{\mathbf{p}_i})^T\]

with above estimate we obtain robust estimate for the variance-covariance matrix of \(\hat{\mathbf{\beta}}\)
structure of expression above leads to terminology sandwich estimator.

Example (Faraway)

Experiment to study the effects of surface and vision on balance.

40 subjects (20 each M/F)
two experimental factors - standing surface (normal or foam) - vision condition (eyes closed/open, or eyes open with dome over head)
- outcome = experimenter assessment of stability of subject
  - dichotomized for illustration purposes
- within subject replication - two times
  - 12 observations per subject

preliminary inspection

##  stable                 age                       
##  Unstable:366(76.2%)             Mean( SD , N )   
##  Stable  :114(23.8%)    Unstable 26.8(6.35,366)   
##                         Stable   26.9(6.41,114)   
##                                s=6.4,r2=0.00011   
##                                                   
##  sex                                    
##            Unstable        Stable       
##  female 193/240(80.4%)  47/240(19.6%)   
##  male   173/240(72.1%)  67/240(27.9%)   
##                                         
##                                         
##  surface                              
##          Unstable        Stable       
##  foam 230/240(95.8%)  10/240( 4.2%)   
##  norm 136/240(56.7%) 104/240(43.3%)   
##                                       
##                                       
##  vision                                 
##            Unstable        Stable       
##  closed 143/160(89.4%)  17/160(10.6%)   
##  dome   138/160(86.2%)  22/160(13.8%)   
##  open    85/160(53.1%)  75/160(46.9%)   
##

Ordinary logistic regression

##              Coeff. p-value 95% Limits       
## (Intercept) -5.8000 3.8e-12     -7.500 -4.200
## age          0.0082 7.2e-01     -0.037  0.053
## sexmale      0.8400 4.9e-03      0.260  1.400
## surfacenorm  3.8000 7.6e-19      2.900  4.600
## visiondome   0.3400 3.6e-01     -0.390  1.100
## visionopen   3.0000 2.8e-14      2.200  3.800

GEE with working independence assumption (i.e. R= diagonal, 1’s)

geefit1 <- geeglm( I(stable=="Stable") ~ age + sex + surface + vision, id=subject,                    family=binomial,data=ctsib) 
summary(geefit1)

## 
## Call:
## geeglm(formula = I(stable == "Stable") ~ age + sex + surface + 
##     vision, family = binomial, data = ctsib, id = subject)
## 
##  Coefficients:
##              Estimate   Std.err   Wald Pr(>|W|)    
## (Intercept) -5.820702  1.353114 18.505 1.69e-05 ***
## age          0.008244  0.040074  0.042    0.837    
## sexmale      0.840758  0.568497  2.187    0.139    
## surfacenorm  3.780236  0.478625 62.380 2.89e-15 ***
## visiondome   0.344369  0.389482  0.782    0.377    
## visionopen   3.023124  0.422501 51.198 8.35e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)   0.7156  0.2925
## 
## Correlation: Structure = independenceNumber of clusters:   40   Maximum cluster size: 12

GEE, equi-correlation structure

geefit2 <- geeglm( I(stable=="Stable") ~ age + sex + surface + vision, id=subject,                    family=binomial,data=ctsib,corstr="exchangeable") 
summary(geefit2)

## 
## Call:
## geeglm(formula = I(stable == "Stable") ~ age + sex + surface + 
##     vision, family = binomial, data = ctsib, id = subject, corstr = "exchangeable")
## 
##  Coefficients:
##              Estimate   Std.err  Wald Pr(>|W|)    
## (Intercept) -5.462658  1.467088 13.86   0.0002 ***
## age          0.000472  0.044402  0.00   0.9915    
## sexmale      0.895973  0.607566  2.17   0.1403    
## surfacenorm  3.698260  0.488029 57.43  3.5e-14 ***
## visiondome   0.336671  0.369896  0.83   0.3627    
## visionopen   2.992531  0.422202 50.24  1.4e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    0.691   0.276
## 
## Correlation: Structure = exchangeable  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.226   0.112
## Number of clusters:   40   Maximum cluster size: 12

Mixed effects model, random subject term

mefit <- glmer( I(stable=="Stable") ~ age + sex + surface + vision + (1 | subject),
                  family=binomial, data=ctsib)
summary(mefit)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: 
## I(stable == "Stable") ~ age + sex + surface + vision + (1 | subject)
##    Data: ctsib
## 
##      AIC      BIC   logLik deviance df.resid 
##      248      277     -117      234      473 
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -4.771 -0.140 -0.016  0.000  5.532 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  subject (Intercept) 10.2     3.19    
## Number of obs: 480, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.7617     3.0328   -3.88  0.00011 ***
## age          -0.0026     0.0858   -0.03  0.97580    
## sexmale       1.8474     1.1197    1.65  0.09896 .  
## surfacenorm   7.7872     1.2555    6.20  5.6e-10 ***
## visiondome    0.6889     0.5337    1.29  0.19672    
## visionopen    6.5460     1.1640    5.62  1.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) age    sexmal srfcnr visndm
## age         -0.744                            
## sexmale     -0.305 -0.014                     
## surfacenorm -0.586 -0.018  0.216              
## visiondome  -0.167  0.001  0.033  0.116       
## visionopen  -0.578 -0.018  0.216  0.874  0.345

We see a marked difference between GEE and mixed effects approach

introduction of explicit subject term in mixed effects model corresponds to stratification by subject - \(\beta\)’s are subject-specific effect - generic terminology is cluster-specific
- GEE estimates don’t included term for subject in in specification of conditional expectations
  - \(\beta\)’s correspond to marginal effects - averaging over subjects
  - so-called population-averaged effects
- This distinction doesn’t exist in linear model
  - linear models are collapsible

Collapsibility

Relationships between three variables \(X\), \(Y\), and \(Z\) are said to be collapsible if the joint distribution between \(X\), \(Y\), and \(Z\) can be accurately inferred from the three pair-wise marginal distributions
- A complicating feature of the logistic regression model and other models based on log odds is non-collapsibility. A simple but somewhat extreme example is illustrated below.
- binary response \(y\) with two binary predictors \(x\) and \(z\)

print(tb)

## , , z = 0
## 
##    y
## x    0  1
##   0 24  1
##   1 23  2
## 
## , , z = 1
## 
##    y
## x    0  1
##   0  2 23
##   1  1 24

fit <- glm( y ~ x + z, family=binomial,data=df)
coefTable(fit,trans=exp)

##               Coeff.  p-value 95% Limits         
## (Intercept)   0.0417 1.45e-04    0.00809    0.215
## x             2.0870 4.09e-01    0.36452   11.948
## z           276.0000 2.73e-10   48.20741 1580.172

fitc <- glm( y ~ x , family=binomial,data=df)
coefTable(fitc,trans=exp)

##             Coeff. p-value 95% Limits     
## (Intercept)  0.923   0.777      0.530 1.61
## x            1.174   0.689      0.536 2.57

Knee jerk explanation of difference is confounding between \(x\) and \(z\) - below shows isn’t so.

fitz <- glm( x ~ z , family=binomial,data=df)
coefTable(fitz,trans=exp)

##             Coeff. p-value 95% Limits     
## (Intercept)      1       1      0.574 1.74
## z                1       1      0.457 2.19

A more thoughtful perspective is examination of odds ratios

stratified odds ratio (conditioning on z)

##    0    1 
## 2.09 2.09

un-stratified odds ratio (marginalizing over z)

## [1] 1.17

##    y
## z    0  1
##   0 47  3
##   1  3 47

Problems in Causal Inference.

Collinearity (also multi-collinearity)

Collinearity is an ever-present complication of multi-predictor models, since the estimated coefficient of a variable may change dramatically depending on what other variables are in the model.
- A naive approach to dealing with collinearity is to use automatic variable selection to choose a set of parismonious variables. Variable selection is aimed at eliminating non-important variables, but it may remove variables that are important if they are highly collinear with other variables.
- Scientifically sound model building must take into account the root causes of multicollinearity.

Causal Inference.

Many (?most?) studies are trying to understand causal relationships.
- What is a causal relationship? Often speak in terms of exposure and outcome
- Informal definition (deterministic): A causal relationship exists if a change in exposure results a change in outcome.
- Probabilistic definition: if a change in exposure alters the probability of an outcome.
How do we establish causality in a relationship.
- Experimentation: e.g. animal experiments in tobacco smoke exposure and mice.
- Randomized experiments in humans - usually restricted to examining beneficial exposures.
- Observational studies or naturalistic studies
- Cohort studies
- Case control studies.
What is confounding?
- In observational studies apparent association between an exposure and outcome may be misleading due to the impact of a third, confounding, variable
- Confounder - is a variable that is causally linked to the outcome of interest, but also associated with the exposure of interest.

Identifying confounding

Approaching confounding depends first of all on having a conceptual causal model, which identifies potential pathways of causality.
- Causal diagrams, as seen later in the course, can be helpful.
Indication that a variable \(Z\) is a possible confounder of the relationship of \(Y\) and \(X\)
- A pattern of association between \(X\) and \(Z\).
- An appreciable change in the magnitude of the coefficient for \(X\) in it’s relationship with \(Y\) (e.g. 10% or more) when \(Z\) is included in the model
- Indication that \(Z\) is itself an “independent” determinant of \(Y\) by the statistical significance of \(Z\) in the model \(Y\) ~ \(X + Z\).
Confounding is a challenging problem to deal with
- Potential confounders must be identified in the study design so that appropriate data may be collected.
- Confounders must be measured accurately - inaccurate measurement of Z will attenuate it’s apparent influence, and make it harder to control for
- The model for \(Y\) as a function of \(X\) and \(Z\) must be accurately specified, e.g. functional forms for predictors.

Effect Mediation (a.k.a interaction)

Patterns of apparent confounding may also be indications of effect mediation - \(Z\) is a mediator of the effect of \(X\) on \(Y\) when \(X\) acts causally on \(Z\) which then acts causally on \(Y\).