2018-10-29
In chapter 6, Faraway provides data relating to the explosion after launch of the space shuttle Challenger in 1986. 2018-10-29
In 23 previous launches evidence of (non-disatrous) damage due to blow by and erosion due to O-ring failure was recorded.
The proportion of O-ring failures is plotted against launch time temperature over the 23 launches.
A variable, call it \(Y\), is called a binary (or Bernoulli) variable if it takes only two values
\[P \left( Y = y \right) = p^y ( 1-p)^{y-1}\]
A binomial variable is the number of \(Successes\) in a number of repetitions of a Bernoulli experiment.
\[ P \left( Y _ { i } = y _ { i } \right) = \left( \begin{array} { c } { n _ { i } } \\ { y _ { i } } \end{array} \right) p _ { i } ^ { y _ { i } } \left( 1 - p _ { i } \right) ^ { n _ { i } - y _ { i } } \]
Note that a binary variable is just a binomial variable with \(n=1\)
Fitting this model to the Challenger data yields
##
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
## data = orings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9529 -0.7345 -0.4393 -0.2079 1.9565
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.74351 1.61542 2.936 0.00332 **
## temp -0.38922 0.09572 -4.066 4.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38.898 on 22 degrees of freedom
## Residual deviance: 16.912 on 21 degrees of freedom
## AIC: 33.675
##
## Number of Fisher Scoring iterations: 6
We’ll talk more about interpreting such results later. For now the best way to examine the fitted model is graphically.
Based on the model fit above we can see that the probability of O-ring failure at temperatures less that 0 is nearly 1.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 115. 1.62 2.94 0.00332
## 2 temp 0.678 0.0957 -4.07 0.0000478
Can also relate \(p\) and \(\beta_0 + \beta_1 x_i\) assuming a \(g\) such that \(g ( p_i ) = \eta = \beta_0 + \beta_1 x_i\) .
Other link functions that can be used, including
Probit: \(\eta = \Phi ^ { - 1 } ( p )\) where \(\Phi\) is the normal cumulative distribution function.
Complementary log-log: \(\eta = \log ( - \log ( 1 - p ) )\)
Extending the one variable logistic regression model to the multiple predictor model is easily done.
\[ \eta _ { i } = \beta _ { 0 } + \beta _ { 1 } x _ { i 1 } + \cdots + \beta _ { q } x _ { i q } \]
Combining the above with binomial distribution yields overall probabilistic model, permitting likelihood based inference.
\[ l ( \beta ) = \sum _ { i = 1 } ^ { n } \left[ y _ { i } \eta _ { i } - n _ { i } \log \left( 1 + e^ { { \eta _ i }} \right) + \log \left( \begin{array} { l } { n _ { i } } \\ { y _ { i } } \end{array} \right) \right] \]
The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the “type A” behavior pattern and coronary heart disease (Rosenman et al., 1964). Here’s some output from an initial exploration of some of the variables including the smoking variable, which has been established as a risk factor for coronary heart disease.
## wcgs[allVars] age chol
## N :3154 Mean : 46.278690 Mean : 226.37240
## Complete cases :3142 S.D. : 5.524045 S.D. : 43.42043
## % missing items: 0 Min. : 39.000000 Min. : 103.00000
## Variables : 6 1st Qu.: 42.000000 1st Qu.: 197.25000
## Factors : 3 Median : 45.000000 Median : 223.00000
## 3rd Qu.: 50.000000 3rd Qu.: 253.00000
## Max. : 59.000000 Max. : 645.00000
## n :3154.000000 n :3142.00000
## NA's : 12.00000
##
## bmi smoke typeA
## Mean : 24.518370 No :1652(52.4%) A:1589(50.4%)
## S.D. : 2.567496 Yes:1502(47.6%) B:1565(49.6%)
## Min. : 11.190610
## 1st Qu.: 22.955100
## Median : 24.389800
## 3rd Qu.: 25.840160
## Max. : 38.947370
## n :3154.000000
##
##
## chd69
## No :2897(91.9%)
## Yes: 257( 8.1%)
##
##
##
##
##
##
##
##
## chd69 age chol
## No :2897(91.9%) Mean( SD , N ) Mean( SD , N )
## Yes: 257( 8.1%) No 46.1(5.46,2897) No 224(42.2,2885)
## Yes 48.5(5.80, 257) Yes 250(49.4, 257)
## s=5.5,r2=0.014 s=43,r2=0.027
##
## bmi smoke
## Mean( SD , N ) No Yes
## No 24.5(2.56,2897) No 1554/1652(94.1%) 98/1652( 5.9%)
## Yes 25.1(2.58, 257) Yes 1343/1502(89.4%) 159/1502(10.6%)
## s=2.6,r2=0.0039
##
## typeA
## No Yes
## A 1411/1589(88.8%) 178/1589(11.2%)
## B 1486/1565(95.0%) 79/1565( 5.0%)
##
##
##
## Call:
## glm(formula = chd69 ~ age + chol + bmi + smoke + typeA, family = binomial,
## data = wcgs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2419 -0.4452 -0.3282 -0.2310 2.9352
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.450174 0.945501 -11.053 < 2e-16 ***
## age 0.067912 0.011744 5.783 7.35e-09 ***
## chol 0.011135 0.001489 7.476 7.65e-14 ***
## bmi 0.086813 0.025631 3.387 0.000707 ***
## smokeYes 0.614705 0.140115 4.387 1.15e-05 ***
## typeAB -0.730325 0.143669 -5.083 3.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1779.2 on 3141 degrees of freedom
## Residual deviance: 1608.3 on 3136 degrees of freedom
## (12 observations deleted due to missingness)
## AIC: 1620.3
##
## Number of Fisher Scoring iterations: 6
Interpreting these coefficients requires some careful thought. Let’s return to the single variable case where we have \(\eta_i = \beta_0 + \beta_1 x_i\). Suppose we wish to compare the linear predictor values at two specific values of \(x\), \(x_k\) and \(x_l\). It’s not hard to see that
\[ \eta_k - \eta_l = \beta_1 \times ( x_k - x_l) \]
\(\eta_k - \eta_l\) may also be written as
\[ log( \frac{p_{x_k}}{1-p_{x_k}} ) - log( \frac{p_{x_l}}{1-p_{x_l}} ) \]
Exponentiating these two expressions yields
\[e^{\beta_1 \times ( x_k - x_l)} = {\frac{p_{x_k}}{1-p_{x_k}}} / {\frac{p_{x_l}}{1-p_{x_l}}} \] where we see that the right hand side of the expression is just the odds ratio for \(p_k\) versus \(p_l\). If $x_k - \(x_l = 1\) , the left hand side is just \(e^{\beta_1}\). Thus exponentiating the logistic slope yields to odds ratio for two observations whose x values differ by 1.
If the values of \(p_{x_k}\) and \(p_{x_l})\) were smallish (i.e. less than about .1) then we have that \(e^{ \beta_1 ( x_k - x_l )}\) is approximately equal to the relative risk (RR) \(Prob(Y=1 )\) when \(X = x_2\) , because both \(1-p_{x_1}\) and \(1-p_{x_2}\) will be approximately 1.
More informative output can be obtained using the \(tidy\) command.
## # A tibble: 6 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0000289 0.946 -11.1 2.13e-28 0.00000445 0.000182
## 2 age 1.07 0.0117 5.78 7.35e- 9 1.05 1.10
## 3 chol 1.01 0.00149 7.48 7.65e-14 1.01 1.01
## 4 bmi 1.09 0.0256 3.39 7.07e- 4 1.04 1.15
## 5 smokeYes 1.85 0.140 4.39 1.15e- 5 1.41 2.44
## 6 typeAB 0.482 0.144 -5.08 3.71e- 7 0.362 0.636
## # A tibble: 6 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0000289 0.946 -11.1 2.13e-28 0.00000445 0.000182
## 2 I(age/10) 1.97 0.117 5.78 7.35e- 9 1.57 2.48
## 3 I(chol/100) 3.04 0.149 7.48 7.65e-14 2.28 4.09
## 4 I(bmi/10) 2.38 0.256 3.39 7.07e- 4 1.44 3.93
## 5 smokeYes 1.85 0.140 4.39 1.15e- 5 1.41 2.44
## 6 typeAB 0.482 0.144 -5.08 3.71e- 7 0.362 0.636
Likelihood theory
\[ \hat { \beta } _ { i } \pm z ^ { \alpha / 2 } se \left( \hat { \beta } _ { i } \right) \] - following test statistic is approximately standard Normal
\[Z = \frac{\hat { \beta } _ { i } - \beta _ i}{se \left( \hat { \beta } _ { i } \right)} \]
\[Z = \frac{\hat { \beta } _ { i }} {se \left( \hat { \beta } _ { i } \right)}\]
\[ 2 \log \frac { L _ { L } } { L _ { S } } \]
\[ D = 2 \sum _ { i = 1 } ^ { n } \left\{ y _ { i } \log y _ { i } / \hat { y } _ { i } + \left( n _ { i } - y _ { i } \right) \log \left( n _ { i } - y _ { i } \right) / \left( n _ { i } - \hat { y } _ { i } \right) \right\} \]
\[ - 2 \sum _ { i = 1 } ^ { n } \left\{ \hat { p } _ { i } \operatorname { logit } \left( \hat { p } _ { i } \right) + \log \left( 1 - \hat { p } _ { i } \right) \right\} \]
\[ p _ { i } = P \left( T \leq d _ { i } \right) = \Phi \left( \left( d _ { i } - \mu \right) / \sigma \right) \]
\[ \Phi ^ { - 1 } \left( p _ { i } \right) = - \mu / \sigma + d _ { i } / \sigma \]