Binary and Binomial Response Models

2018-10-29

Shuttle Disaster Example

In chapter 6, Faraway provides data relating to the explosion after launch of the space shuttle Challenger in 1986. 2018-10-29

Cause of the crash thought to be due to O-ring failure
O-rings were known to lose flexibility in cold
Temperature on launch day below zero

In 23 previous launches evidence of (non-disatrous) damage due to blow by and erosion due to O-ring failure was recorded.

Each shuttle had two boosters with three O-rings each
the number of O-rings out of 6 showing such damage was recorded.

The proportion of O-ring failures is plotted against launch time temperature over the 23 launches.

The linear regression line in the plot above suggests a decreasing relationship.
Not sufficiently accurate to capture exact relationship.

Binary and Binomial variables

Binary or Bernoulli variable

A variable, call it $Y$, is called a binary (or Bernoulli) variable if it takes only two values

$Success$ and $Failure$ generic terms for the outcomes
For convenience we let these be $1$ denote $Success$ and $0$ denote $Failure$
Let $P \left( Y = 1 \right) = p$

\[P \left( Y = y \right) = p^y ( 1-p)^{y-1}\]

Binomial variable

A binomial variable is the number of $Successes$ in a number of repetitions of a Bernoulli experiment.

let $Y = \sum _ { i=1 } ^ { n } y_r$
$y_i$’s independent binary success probability $p$
Observe a set of $Y_i$’s each with their own $n_i$ and $p_i$, have

\[ P \left( Y _ { i } = y _ { i } \right) = \left( \begin{array} { c } { n _ { i } } \\ { y _ { i } } \end{array} \right) p _ { i } ^ { y _ { i } } \left( 1 - p _ { i } \right) ^ { n _ { i } - y _ { i } } \]

Note that a binary variable is just a binomial variable with $n=1$

Model for the O-ring data

$Y_i, i = 1 \ldots 23$ represent the failure counts
Observed proportions estimate the $p_i$’s, the true failure probability at temperature $x_i$
Need a non-linear formula to capture the relationship as all $p_i$’s must be between $0$ and $1$
A partially linear relation, $p_i =$ function of $\beta_0 + \beta_1 x_i$ puts things in a regression context
In usual linear regression $\mu_i = \beta_0 + \beta_1 x_i$
Generalized linear regression let $p_i = f ( \beta_0 + \beta_1 x_i )$
One convenient function that does this is $p(s) = \frac{1}{1 + e^{-s}}$ which gives rise to the logistic regression model

Examples

Fitting this model to the Challenger data yields

## 
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial, 
##     data = orings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9529  -0.7345  -0.4393  -0.2079   1.9565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.74351    1.61542   2.936  0.00332 ** 
## temp        -0.38922    0.09572  -4.066 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38.898  on 22  degrees of freedom
## Residual deviance: 16.912  on 21  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 6

INDEPENDENCE ASSUMPTION

MODEL FITS SUFFICIENTLY WELL

We’ll talk more about interpreting such results later. For now the best way to examine the fitted model is graphically.

Based on the model fit above we can see that the probability of O-ring failure at temperatures less that 0 is nearly 1.

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  115.       1.62        2.94 0.00332  
## 2 temp           0.678    0.0957     -4.07 0.0000478

RELATIVE RISK INTERPRETATIONS

Link Function

Can also relate $p$ and $\beta_0 + \beta_1 x_i$ assuming a $g$ such that $g ( p_i ) = \eta = \beta_0 + \beta_1 x_i$ .

For logistic model we logistic link function $\eta = \log ( p / ( 1 - p ) )$
Other link functions that can be used, including
Probit: $\eta = \Phi ^ { - 1 } ( p )$ where $\Phi$ is the normal cumulative distribution function.
Complementary log-log: $\eta = \log ( - \log ( 1 - p ) )$

Multiple Logistic Regression

Extending the one variable logistic regression model to the multiple predictor model is easily done.

\[ \eta _ { i } = \beta _ { 0 } + \beta _ { 1 } x _ { i 1 } + \cdots + \beta _ { q } x _ { i q } \]

Combining the above with binomial distribution yields overall probabilistic model, permitting likelihood based inference.

the log likelihood multiple logistic regression model is

\[ l ( \beta ) = \sum _ { i = 1 } ^ { n } \left[ y _ { i } \eta _ { i } - n _ { i } \log \left( 1 + e^ { { \eta _ i }} \right) + \log \left( \begin{array} { l } { n _ { i } } \\ { y _ { i } } \end{array} \right) \right] \]

Maximizing this function of the $q$-dimensional parameter vector $\bf{\beta}$ yields the maximum likelihood estimate, $\hat{\bf{\beta}}$.

WCGS example

The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the “type A” behavior pattern and coronary heart disease (Rosenman et al., 1964). Here’s some output from an initial exploration of some of the variables including the smoking variable, which has been established as a risk factor for coronary heart disease.

Distributions of model variables

chd: presence of coronary artery disease (outcome)
age: age in years
chol: cholesterol level
bmi: body mass index
smoke: smoking (yes/no)
typeA: personality type (A/B)

##  wcgs[allVars]          age                   chol                
##  N              :3154   Mean   :  46.278690   Mean   : 226.37240  
##  Complete cases :3142   S.D.   :   5.524045   S.D.   :  43.42043  
##  % missing items:   0   Min.   :  39.000000   Min.   : 103.00000  
##  Variables      :   6   1st Qu.:  42.000000   1st Qu.: 197.25000  
##  Factors        :   3   Median :  45.000000   Median : 223.00000  
##                         3rd Qu.:  50.000000   3rd Qu.: 253.00000  
##                         Max.   :  59.000000   Max.   : 645.00000  
##                         n      :3154.000000   n      :3142.00000  
##                                               NA's   :  12.00000  
##                                                                   
##  bmi                   smoke             typeA           
##  Mean   :  24.518370   No :1652(52.4%)   A:1589(50.4%)   
##  S.D.   :   2.567496   Yes:1502(47.6%)   B:1565(49.6%)   
##  Min.   :  11.190610                                     
##  1st Qu.:  22.955100                                     
##  Median :  24.389800                                     
##  3rd Qu.:  25.840160                                     
##  Max.   :  38.947370                                     
##  n      :3154.000000                                     
##                                                          
##                                                          
##  chd69            
##  No :2897(91.9%)  
##  Yes: 257( 8.1%)  
##                   
##                   
##                   
##                   
##                   
##                   
##                   
##

Initial examination of response relationships

##  chd69              age                    chol                  
##  No :2897(91.9%)        Mean( SD ,  N )        Mean( SD ,  N )   
##  Yes: 257( 8.1%)    No  46.1(5.46,2897)    No   224(42.2,2885)   
##                     Yes 48.5(5.80, 257)    Yes  250(49.4, 257)   
##                          s=5.5,r2=0.014          s=43,r2=0.027   
##                                                                  
##  bmi                    smoke                                   
##      Mean( SD ,  N )               No               Yes         
##  No  24.5(2.56,2897)    No  1554/1652(94.1%)   98/1652( 5.9%)   
##  Yes 25.1(2.58, 257)    Yes 1343/1502(89.4%)  159/1502(10.6%)   
##      s=2.6,r2=0.0039                                            
##                                                                 
##  typeA                                 
##           No               Yes         
##  A 1411/1589(88.8%)  178/1589(11.2%)   
##  B 1486/1565(95.0%)   79/1565( 5.0%)   
##                                        
##

Preliminary Logistic Regression Model Fit

## 
## Call:
## glm(formula = chd69 ~ age + chol + bmi + smoke + typeA, family = binomial, 
##     data = wcgs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2419  -0.4452  -0.3282  -0.2310   2.9352  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.450174   0.945501 -11.053  < 2e-16 ***
## age           0.067912   0.011744   5.783 7.35e-09 ***
## chol          0.011135   0.001489   7.476 7.65e-14 ***
## bmi           0.086813   0.025631   3.387 0.000707 ***
## smokeYes      0.614705   0.140115   4.387 1.15e-05 ***
## typeAB       -0.730325   0.143669  -5.083 3.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.2  on 3141  degrees of freedom
## Residual deviance: 1608.3  on 3136  degrees of freedom
##   (12 observations deleted due to missingness)
## AIC: 1620.3
## 
## Number of Fisher Scoring iterations: 6

Interpreting these coefficients requires some careful thought. Let’s return to the single variable case where we have $\eta_i = \beta_0 + \beta_1 x_i$. Suppose we wish to compare the linear predictor values at two specific values of $x$, $x_k$ and $x_l$. It’s not hard to see that

\[ \eta_k - \eta_l = \beta_1 \times ( x_k - x_l) \]

$\eta_k - \eta_l$ may also be written as

\[ log( \frac{p_{x_k}}{1-p_{x_k}} ) - log( \frac{p_{x_l}}{1-p_{x_l}} ) \]

Exponentiating these two expressions yields

\[e^{\beta_1 \times ( x_k - x_l)} = {\frac{p_{x_k}}{1-p_{x_k}}} / {\frac{p_{x_l}}{1-p_{x_l}}} \] where we see that the right hand side of the expression is just the odds ratio for $p_k$ versus $p_l$. If $x_k - $x_l = 1$ , the left hand side is just $e^{\beta_1}$. Thus exponentiating the logistic slope yields to odds ratio for two observations whose x values differ by 1.

If the values of $p_{x_k}$ and $p_{x_l})$ were smallish (i.e. less than about .1) then we have that $e^{ \beta_1 ( x_k - x_l )}$ is approximately equal to the relative risk (RR) $Prob(Y=1 )$ when $X = x_2$ , because both $1-p_{x_1}$ and $1-p_{x_2}$ will be approximately 1.

More informative output can be obtained using the $tidy$ command.

## # A tibble: 6 x 7
##   term         estimate std.error statistic  p.value   conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>      <dbl>     <dbl>
## 1 (Intercept) 0.0000289   0.946      -11.1  2.13e-28 0.00000445  0.000182
## 2 age         1.07        0.0117       5.78 7.35e- 9 1.05        1.10    
## 3 chol        1.01        0.00149      7.48 7.65e-14 1.01        1.01    
## 4 bmi         1.09        0.0256       3.39 7.07e- 4 1.04        1.15    
## 5 smokeYes    1.85        0.140        4.39 1.15e- 5 1.41        2.44    
## 6 typeAB      0.482       0.144       -5.08 3.71e- 7 0.362       0.636

Results after rescaling variables for interpretability

## # A tibble: 6 x 7
##   term         estimate std.error statistic  p.value   conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>      <dbl>     <dbl>
## 1 (Intercept) 0.0000289     0.946    -11.1  2.13e-28 0.00000445  0.000182
## 2 I(age/10)   1.97          0.117      5.78 7.35e- 9 1.57        2.48    
## 3 I(chol/100) 3.04          0.149      7.48 7.65e-14 2.28        4.09    
## 4 I(bmi/10)   2.38          0.256      3.39 7.07e- 4 1.44        3.93    
## 5 smokeYes    1.85          0.140      4.39 1.15e- 5 1.41        2.44    
## 6 typeAB      0.482         0.144     -5.08 3.71e- 7 0.362       0.636

PROFILE BASED CONFIDENCE LIMITS

Inference for the $\beta$’s

Likelihood theory

The maximum likelihood estimates are, $\hat{\bf{\beta}}$ consistent estimates of $\bf{\beta}$
$\hat{\bf{\beta}}$ has asymptotically (i.e. approximately) normal distribution with mean $\bf{\beta}$
Asymptotic variance of $\hat{\bf{\beta}}$ can be derived from the curvature (2nd derivative) of the log likelihood.
yields $se \left( \hat { \beta } _ { i } \right)$’s and approximate confidence interval:

\[ \hat { \beta } _ { i } \pm z ^ { \alpha / 2 } se \left( \hat { \beta } _ { i } \right) \] - following test statistic is approximately standard Normal

\[Z = \frac{\hat { \beta } _ { i } - \beta _ i}{se \left( \hat { \beta } _ { i } \right)} \]

usually value of interest is $\beta_i = 0$, so output provides

\[Z = \frac{\hat { \beta } _ { i }} {se \left( \hat { \beta } _ { i } \right)}\]

Likelihood ratio test

\[ 2 \log \frac { L _ { L } } { L _ { S } } \]

The deviance

\[ D = 2 \sum _ { i = 1 } ^ { n } \left\{ y _ { i } \log y _ { i } / \hat { y } _ { i } + \left( n _ { i } - y _ { i } \right) \log \left( n _ { i } - y _ { i } \right) / \left( n _ { i } - \hat { y } _ { i } \right) \right\} \]

\[ - 2 \sum _ { i = 1 } ^ { n } \left\{ \hat { p } _ { i } \operatorname { logit } \left( \hat { p } _ { i } \right) + \log \left( 1 - \hat { p } _ { i } \right) \right\} \]

\[ p _ { i } = P \left( T \leq d _ { i } \right) = \Phi \left( \left( d _ { i } - \mu \right) / \sigma \right) \]

\[ \Phi ^ { - 1 } \left( p _ { i } \right) = - \mu / \sigma + d _ { i } / \sigma \]