Processing math: 100%
  • Binary and Binomial Response Models
    • Shuttle Disaster Example
    • Binary and Binomial variables
    • Binomial variable
    • Examples
    • RELATIVE RISK INTERPRETATIONS
    • Link Function
    • Multiple Logistic Regression
    • Distributions of model variables
    • Initial examination of response relationships
    • Preliminary Logistic Regression Model Fit
    • Results after rescaling variables for interpretability
    • PROFILE BASED CONFIDENCE LIMITS
    • Inference for the β’s
    • Likelihood ratio test
    • The deviance

Binary and Binomial Response Models

2018-10-29

Shuttle Disaster Example

In chapter 6, Faraway provides data relating to the explosion after launch of the space shuttle Challenger in 1986. 2018-10-29

  • Cause of the crash thought to be due to O-ring failure
  • O-rings were known to lose flexibility in cold
  • Temperature on launch day below zero

In 23 previous launches evidence of (non-disatrous) damage due to blow by and erosion due to O-ring failure was recorded.

  • Each shuttle had two boosters with three O-rings each
  • the number of O-rings out of 6 showing such damage was recorded.

The proportion of O-ring failures is plotted against launch time temperature over the 23 launches.

  • The linear regression line in the plot above suggests a decreasing relationship.
  • Not sufficiently accurate to capture exact relationship.

Binary and Binomial variables

Binary or Bernoulli variable

A variable, call it Y, is called a binary (or Bernoulli) variable if it takes only two values

  • Success and Failure generic terms for the outcomes
  • For convenience we let these be 1 denote Success and 0 denote Failure
  • Let P(Y=1)=p

P(Y=y)=py(1p)y1

Binomial variable

A binomial variable is the number of Successes in a number of repetitions of a Bernoulli experiment.

  • let Y=ni=1yr
  • yi’s independent binary success probability p
  • Observe a set of Yi’s each with their own ni and pi, have

P(Yi=yi)=(niyi)pyii(1pi)niyi

Note that a binary variable is just a binomial variable with n=1

Model for the O-ring data

  • Yi,i=123 represent the failure counts
  • Observed proportions estimate the pi’s, the true failure probability at temperature xi
  • Need a non-linear formula to capture the relationship as all pi’s must be between 0 and 1
  • A partially linear relation, pi= function of β0+β1xi puts things in a regression context
  • In usual linear regression μi=β0+β1xi
  • Generalized linear regression let pi=f(β0+β1xi)
  • One convenient function that does this is p(s)=11+es which gives rise to the logistic regression model

Examples

Fitting this model to the Challenger data yields

## 
## Call:
## glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial, 
##     data = orings)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9529  -0.7345  -0.4393  -0.2079   1.9565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.74351    1.61542   2.936  0.00332 ** 
## temp        -0.38922    0.09572  -4.066 4.78e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38.898  on 22  degrees of freedom
## Residual deviance: 16.912  on 21  degrees of freedom
## AIC: 33.675
## 
## Number of Fisher Scoring iterations: 6

INDEPENDENCE ASSUMPTION

MODEL FITS SUFFICIENTLY WELL

We’ll talk more about interpreting such results later. For now the best way to examine the fitted model is graphically.

Based on the model fit above we can see that the probability of O-ring failure at temperatures less that 0 is nearly 1.

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  115.       1.62        2.94 0.00332  
## 2 temp           0.678    0.0957     -4.07 0.0000478

RELATIVE RISK INTERPRETATIONS

Multiple Logistic Regression

Extending the one variable logistic regression model to the multiple predictor model is easily done.

ηi=β0+β1xi1++βqxiq

Combining the above with binomial distribution yields overall probabilistic model, permitting likelihood based inference.

  • the log likelihood multiple logistic regression model is

l(β)=ni=1[yiηinilog(1+eηi)+log(niyi)]

  • Maximizing this function of the q-dimensional parameter vector β yields the maximum likelihood estimate, ˆβ.

WCGS example

The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the “type A” behavior pattern and coronary heart disease (Rosenman et al., 1964). Here’s some output from an initial exploration of some of the variables including the smoking variable, which has been established as a risk factor for coronary heart disease.

Distributions of model variables

  • chd: presence of coronary artery disease (outcome)
  • age: age in years
  • chol: cholesterol level
  • bmi: body mass index
  • smoke: smoking (yes/no)
  • typeA: personality type (A/B)
##  wcgs[allVars]          age                   chol                
##  N              :3154   Mean   :  46.278690   Mean   : 226.37240  
##  Complete cases :3142   S.D.   :   5.524045   S.D.   :  43.42043  
##  % missing items:   0   Min.   :  39.000000   Min.   : 103.00000  
##  Variables      :   6   1st Qu.:  42.000000   1st Qu.: 197.25000  
##  Factors        :   3   Median :  45.000000   Median : 223.00000  
##                         3rd Qu.:  50.000000   3rd Qu.: 253.00000  
##                         Max.   :  59.000000   Max.   : 645.00000  
##                         n      :3154.000000   n      :3142.00000  
##                                               NA's   :  12.00000  
##                                                                   
##  bmi                   smoke             typeA           
##  Mean   :  24.518370   No :1652(52.4%)   A:1589(50.4%)   
##  S.D.   :   2.567496   Yes:1502(47.6%)   B:1565(49.6%)   
##  Min.   :  11.190610                                     
##  1st Qu.:  22.955100                                     
##  Median :  24.389800                                     
##  3rd Qu.:  25.840160                                     
##  Max.   :  38.947370                                     
##  n      :3154.000000                                     
##                                                          
##                                                          
##  chd69            
##  No :2897(91.9%)  
##  Yes: 257( 8.1%)  
##                   
##                   
##                   
##                   
##                   
##                   
##                   
## 

Initial examination of response relationships

##  chd69              age                    chol                  
##  No :2897(91.9%)        Mean( SD ,  N )        Mean( SD ,  N )   
##  Yes: 257( 8.1%)    No  46.1(5.46,2897)    No   224(42.2,2885)   
##                     Yes 48.5(5.80, 257)    Yes  250(49.4, 257)   
##                          s=5.5,r2=0.014          s=43,r2=0.027   
##                                                                  
##  bmi                    smoke                                   
##      Mean( SD ,  N )               No               Yes         
##  No  24.5(2.56,2897)    No  1554/1652(94.1%)   98/1652( 5.9%)   
##  Yes 25.1(2.58, 257)    Yes 1343/1502(89.4%)  159/1502(10.6%)   
##      s=2.6,r2=0.0039                                            
##                                                                 
##  typeA                                 
##           No               Yes         
##  A 1411/1589(88.8%)  178/1589(11.2%)   
##  B 1486/1565(95.0%)   79/1565( 5.0%)   
##                                        
## 

Preliminary Logistic Regression Model Fit

## 
## Call:
## glm(formula = chd69 ~ age + chol + bmi + smoke + typeA, family = binomial, 
##     data = wcgs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2419  -0.4452  -0.3282  -0.2310   2.9352  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.450174   0.945501 -11.053  < 2e-16 ***
## age           0.067912   0.011744   5.783 7.35e-09 ***
## chol          0.011135   0.001489   7.476 7.65e-14 ***
## bmi           0.086813   0.025631   3.387 0.000707 ***
## smokeYes      0.614705   0.140115   4.387 1.15e-05 ***
## typeAB       -0.730325   0.143669  -5.083 3.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.2  on 3141  degrees of freedom
## Residual deviance: 1608.3  on 3136  degrees of freedom
##   (12 observations deleted due to missingness)
## AIC: 1620.3
## 
## Number of Fisher Scoring iterations: 6

Interpreting these coefficients requires some careful thought. Let’s return to the single variable case where we have ηi=β0+β1xi. Suppose we wish to compare the linear predictor values at two specific values of x, xk and xl. It’s not hard to see that

ηkηl=β1×(xkxl)

ηkηl may also be written as

log(pxk1pxk)log(pxl1pxl)

Exponentiating these two expressions yields

eβ1×(xkxl)=pxk1pxk/pxl1pxl where we see that the right hand side of the expression is just the odds ratio for pk versus pl. If $x_k - xl=1 , the left hand side is just eβ1. Thus exponentiating the logistic slope yields to odds ratio for two observations whose x values differ by 1.

If the values of pxk and pxl) were smallish (i.e. less than about .1) then we have that eβ1(xkxl) is approximately equal to the relative risk (RR) Prob(Y=1) when X=x2 , because both 1px1 and 1px2 will be approximately 1.

More informative output can be obtained using the tidy command.

## # A tibble: 6 x 7
##   term         estimate std.error statistic  p.value   conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>      <dbl>     <dbl>
## 1 (Intercept) 0.0000289   0.946      -11.1  2.13e-28 0.00000445  0.000182
## 2 age         1.07        0.0117       5.78 7.35e- 9 1.05        1.10    
## 3 chol        1.01        0.00149      7.48 7.65e-14 1.01        1.01    
## 4 bmi         1.09        0.0256       3.39 7.07e- 4 1.04        1.15    
## 5 smokeYes    1.85        0.140        4.39 1.15e- 5 1.41        2.44    
## 6 typeAB      0.482       0.144       -5.08 3.71e- 7 0.362       0.636

Results after rescaling variables for interpretability

## # A tibble: 6 x 7
##   term         estimate std.error statistic  p.value   conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>      <dbl>     <dbl>
## 1 (Intercept) 0.0000289     0.946    -11.1  2.13e-28 0.00000445  0.000182
## 2 I(age/10)   1.97          0.117      5.78 7.35e- 9 1.57        2.48    
## 3 I(chol/100) 3.04          0.149      7.48 7.65e-14 2.28        4.09    
## 4 I(bmi/10)   2.38          0.256      3.39 7.07e- 4 1.44        3.93    
## 5 smokeYes    1.85          0.140      4.39 1.15e- 5 1.41        2.44    
## 6 typeAB      0.482         0.144     -5.08 3.71e- 7 0.362       0.636

PROFILE BASED CONFIDENCE LIMITS

Inference for the β’s

Likelihood theory

  • The maximum likelihood estimates are, ˆβ consistent estimates of β
  • ˆβ has asymptotically (i.e. approximately) normal distribution with mean β
  • Asymptotic variance of ˆβ can be derived from the curvature (2nd derivative) of the log likelihood.
  • yields se(ˆβi)’s and approximate confidence interval:

ˆβi±zα/2se(ˆβi) - following test statistic is approximately standard Normal

Z=ˆβiβise(ˆβi)

  • usually value of interest is βi=0, so output provides

Z=ˆβise(ˆβi)

Likelihood ratio test

2logLLLS

The deviance

D=2ni=1{yilogyi/ˆyi+(niyi)log(niyi)/(niˆyi)}

2ni=1{ˆpilogit(ˆpi)+log(1ˆpi)}

pi=P(Tdi)=Φ((diμ)/σ)

Φ1(pi)=μ/σ+di/σ