2018-10-29
In chapter 6, Faraway provides data relating to the explosion after launch of the space shuttle Challenger in 1986. 2018-10-29
In 23 previous launches evidence of (non-disatrous) damage due to blow by and erosion due to O-ring failure was recorded.
The proportion of O-ring failures is plotted against launch time temperature over the 23 launches.
A variable, call it Y, is called a binary (or Bernoulli) variable if it takes only two values
P(Y=y)=py(1−p)y−1
A binomial variable is the number of Successes in a number of repetitions of a Bernoulli experiment.
P(Yi=yi)=(niyi)pyii(1−pi)ni−yi
Note that a binary variable is just a binomial variable with n=1
Fitting this model to the Challenger data yields
##
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
## data = orings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9529 -0.7345 -0.4393 -0.2079 1.9565
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.74351 1.61542 2.936 0.00332 **
## temp -0.38922 0.09572 -4.066 4.78e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38.898 on 22 degrees of freedom
## Residual deviance: 16.912 on 21 degrees of freedom
## AIC: 33.675
##
## Number of Fisher Scoring iterations: 6
We’ll talk more about interpreting such results later. For now the best way to examine the fitted model is graphically.
Based on the model fit above we can see that the probability of O-ring failure at temperatures less that 0 is nearly 1.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 115. 1.62 2.94 0.00332
## 2 temp 0.678 0.0957 -4.07 0.0000478
Can also relate p and β0+β1xi assuming a g such that g(pi)=η=β0+β1xi .
Other link functions that can be used, including
Probit: η=Φ−1(p) where Φ is the normal cumulative distribution function.
Complementary log-log: η=log(−log(1−p))
Extending the one variable logistic regression model to the multiple predictor model is easily done.
ηi=β0+β1xi1+⋯+βqxiq
Combining the above with binomial distribution yields overall probabilistic model, permitting likelihood based inference.
l(β)=n∑i=1[yiηi−nilog(1+eηi)+log(niyi)]
The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the “type A” behavior pattern and coronary heart disease (Rosenman et al., 1964). Here’s some output from an initial exploration of some of the variables including the smoking variable, which has been established as a risk factor for coronary heart disease.
## wcgs[allVars] age chol
## N :3154 Mean : 46.278690 Mean : 226.37240
## Complete cases :3142 S.D. : 5.524045 S.D. : 43.42043
## % missing items: 0 Min. : 39.000000 Min. : 103.00000
## Variables : 6 1st Qu.: 42.000000 1st Qu.: 197.25000
## Factors : 3 Median : 45.000000 Median : 223.00000
## 3rd Qu.: 50.000000 3rd Qu.: 253.00000
## Max. : 59.000000 Max. : 645.00000
## n :3154.000000 n :3142.00000
## NA's : 12.00000
##
## bmi smoke typeA
## Mean : 24.518370 No :1652(52.4%) A:1589(50.4%)
## S.D. : 2.567496 Yes:1502(47.6%) B:1565(49.6%)
## Min. : 11.190610
## 1st Qu.: 22.955100
## Median : 24.389800
## 3rd Qu.: 25.840160
## Max. : 38.947370
## n :3154.000000
##
##
## chd69
## No :2897(91.9%)
## Yes: 257( 8.1%)
##
##
##
##
##
##
##
##
## chd69 age chol
## No :2897(91.9%) Mean( SD , N ) Mean( SD , N )
## Yes: 257( 8.1%) No 46.1(5.46,2897) No 224(42.2,2885)
## Yes 48.5(5.80, 257) Yes 250(49.4, 257)
## s=5.5,r2=0.014 s=43,r2=0.027
##
## bmi smoke
## Mean( SD , N ) No Yes
## No 24.5(2.56,2897) No 1554/1652(94.1%) 98/1652( 5.9%)
## Yes 25.1(2.58, 257) Yes 1343/1502(89.4%) 159/1502(10.6%)
## s=2.6,r2=0.0039
##
## typeA
## No Yes
## A 1411/1589(88.8%) 178/1589(11.2%)
## B 1486/1565(95.0%) 79/1565( 5.0%)
##
##
##
## Call:
## glm(formula = chd69 ~ age + chol + bmi + smoke + typeA, family = binomial,
## data = wcgs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2419 -0.4452 -0.3282 -0.2310 2.9352
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.450174 0.945501 -11.053 < 2e-16 ***
## age 0.067912 0.011744 5.783 7.35e-09 ***
## chol 0.011135 0.001489 7.476 7.65e-14 ***
## bmi 0.086813 0.025631 3.387 0.000707 ***
## smokeYes 0.614705 0.140115 4.387 1.15e-05 ***
## typeAB -0.730325 0.143669 -5.083 3.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1779.2 on 3141 degrees of freedom
## Residual deviance: 1608.3 on 3136 degrees of freedom
## (12 observations deleted due to missingness)
## AIC: 1620.3
##
## Number of Fisher Scoring iterations: 6
Interpreting these coefficients requires some careful thought. Let’s return to the single variable case where we have ηi=β0+β1xi. Suppose we wish to compare the linear predictor values at two specific values of x, xk and xl. It’s not hard to see that
ηk−ηl=β1×(xk−xl)
ηk−ηl may also be written as
log(pxk1−pxk)−log(pxl1−pxl)
Exponentiating these two expressions yields
eβ1×(xk−xl)=pxk1−pxk/pxl1−pxl where we see that the right hand side of the expression is just the odds ratio for pk versus pl. If $x_k - xl=1 , the left hand side is just eβ1. Thus exponentiating the logistic slope yields to odds ratio for two observations whose x values differ by 1.
If the values of pxk and pxl) were smallish (i.e. less than about .1) then we have that eβ1(xk−xl) is approximately equal to the relative risk (RR) Prob(Y=1) when X=x2 , because both 1−px1 and 1−px2 will be approximately 1.
More informative output can be obtained using the tidy command.
## # A tibble: 6 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0000289 0.946 -11.1 2.13e-28 0.00000445 0.000182
## 2 age 1.07 0.0117 5.78 7.35e- 9 1.05 1.10
## 3 chol 1.01 0.00149 7.48 7.65e-14 1.01 1.01
## 4 bmi 1.09 0.0256 3.39 7.07e- 4 1.04 1.15
## 5 smokeYes 1.85 0.140 4.39 1.15e- 5 1.41 2.44
## 6 typeAB 0.482 0.144 -5.08 3.71e- 7 0.362 0.636
## # A tibble: 6 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0000289 0.946 -11.1 2.13e-28 0.00000445 0.000182
## 2 I(age/10) 1.97 0.117 5.78 7.35e- 9 1.57 2.48
## 3 I(chol/100) 3.04 0.149 7.48 7.65e-14 2.28 4.09
## 4 I(bmi/10) 2.38 0.256 3.39 7.07e- 4 1.44 3.93
## 5 smokeYes 1.85 0.140 4.39 1.15e- 5 1.41 2.44
## 6 typeAB 0.482 0.144 -5.08 3.71e- 7 0.362 0.636
Likelihood theory
ˆβi±zα/2se(ˆβi) - following test statistic is approximately standard Normal
Z=ˆβi−βise(ˆβi)
Z=ˆβise(ˆβi)
2logLLLS
D=2n∑i=1{yilogyi/ˆyi+(ni−yi)log(ni−yi)/(ni−ˆyi)}
−2n∑i=1{ˆpilogit(ˆpi)+log(1−ˆpi)}
pi=P(T≤di)=Φ((di−μ)/σ)
Φ−1(pi)=−μ/σ+di/σ