Model Answers for Assignment 1

The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.

As a rule one should examine summaries and preliminary plots of the data before jumping to the more formal analysis.

data(wbca)
#?wbca
wbca <- within(wbca, Class <- factor(Class, labels=c("Malign","Benign")))
summary(wbca)

##     Class         Adhes            BNucl            Chrom       
##  Malign:238   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  Benign:443   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000  
##               Median : 1.000   Median : 1.000   Median : 3.000  
##               Mean   : 2.816   Mean   : 3.542   Mean   : 3.433  
##               3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000  
##               Max.   :10.000   Max.   :10.000   Max.   :10.000  
##      Epith            Mitos            NNucl            Thick       
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 2.000   Median : 1.000   Median : 1.000   Median : 4.000  
##  Mean   : 3.231   Mean   : 1.604   Mean   : 2.859   Mean   : 4.436  
##  3rd Qu.: 4.000   3rd Qu.: 1.000   3rd Qu.: 4.000   3rd Qu.: 6.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##      UShap            USize      
##  Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 1.000   Median : 1.00  
##  Mean   : 3.204   Mean   : 3.14  
##  3rd Qu.: 5.000   3rd Qu.: 5.00  
##  Max.   :10.000   Max.   :10.00

clear(3,3)
wbcas <- within(wbca, levels(Class) <- c("M","B"))
prelim.plot(Class ~ ., data=wbcas,plot.resp=FALSE)

We note that for all predictor variables, larger values tend to indicate malignancy.

(a, 2 points)

Fit a binomial regression with Class as the response and the other nine variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to determine if this model fits the data? Explain.

lgtFit <- glm(Class ~ ., family=binomial,data=wbca)
summary(lgtFit)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = wbca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.48282  -0.01179   0.04739   0.09678   3.06425  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.16678    1.41491   7.892 2.97e-15 ***
## Adhes       -0.39681    0.13384  -2.965  0.00303 ** 
## BNucl       -0.41478    0.10230  -4.055 5.02e-05 ***
## Chrom       -0.56456    0.18728  -3.014  0.00257 ** 
## Epith       -0.06440    0.16595  -0.388  0.69795    
## Mitos       -0.65713    0.36764  -1.787  0.07387 .  
## NNucl       -0.28659    0.12620  -2.271  0.02315 *  
## Thick       -0.62675    0.15890  -3.944 8.01e-05 ***
## UShap       -0.28011    0.25235  -1.110  0.26699    
## USize        0.05718    0.23271   0.246  0.80589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.388  on 680  degrees of freedom
## Residual deviance:  89.464  on 671  degrees of freedom
## AIC: 109.46
## 
## Number of Fisher Scoring iterations: 8

As we expect from the initial plots above, the signs of the estimated coefficients are all negative, indicating that scores are negatively associated with benign results.

The residual deviance is 89.5 with 671 degrees of freedom. This information is not useful in assessing the fit of the model since the data is in unreplicated format. In this format the residual deviance has 671 degrees of freedom which is too numerous for asymptotic theory to be relevant.

(b, 1 point)

Use AIC as the criterion to determine the best subset of variables. (Use the step function.)

lgtFitR <- step(lgtFit,direction="both")

summary(lgtFitR)

## 
## Call:
## glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl + 
##     Thick + UShap, family = binomial, data = wbca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44161  -0.01119   0.04962   0.09741   3.08205  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  11.0333     1.3632   8.094 5.79e-16 ***
## Adhes        -0.3984     0.1294  -3.080  0.00207 ** 
## BNucl        -0.4192     0.1020  -4.111 3.93e-05 ***
## Chrom        -0.5679     0.1840  -3.085  0.00203 ** 
## Mitos        -0.6456     0.3634  -1.777  0.07561 .  
## NNucl        -0.2915     0.1236  -2.358  0.01837 *  
## Thick        -0.6216     0.1579  -3.937 8.27e-05 ***
## UShap        -0.2541     0.1785  -1.423  0.15461    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 881.388  on 680  degrees of freedom
## Residual deviance:  89.662  on 673  degrees of freedom
## AIC: 105.66
## 
## Number of Fisher Scoring iterations: 8

We reduction using AIC has only eliminated 2 variables. Most of the remaining variables are statistically significant but not all, since AIC is based on information theory rather than classical hypothesis testing.

(c, 3 points)

Use the reduced model to predict the outcome for a new patient with predictor variables 1, 1, 3, 2, 1, 1, 4, 1, 1 (same order as above). Give a confidence interval for your prediction.

newrow <- wbca[1,]
newrow[1,-1] <-  c(1, 1, 3, 2, 1, 1, 4, 1, 1)
newpred <- predict(lgtFitR,newdata=newrow,type="link",se.fit=TRUE)
lgtVals <- newpred$fit + 1.96*c(0,-1,1)*newpred$se.fit
probVals <- round(1/(1+exp(-lgtVals)),3)

The estimated probability of benign tumour for a new patient as above is 0.992 with a 95% confidence interval of 0.976 to 0.997

(d, 3 points)

Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model.

predBenign <- factor(predict(lgtFitR,type="response") > .5, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbca$Class)
kableN(xtab)

	Malign	Benign
pred-Mal.	227	9
pred-Ben.	11	434

pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError1 <- percents( table(as.integer(predBenign)==as.integer(wbca$Class)))[1]

If we use .5 as the cutoff, of the 443 patients with benign tumours, 9( 2.0%) will be incorrectly predicted to be malignant. Of the 238 patients with malignant tumours, 11( 4.6%) will be incorrectly predicted to be benign.

(e, 3 points)

Suppose we change the cutoff to 0.9 so that p < 0.9 is classified as malignant and p > 0.9 as benign. Compute the number of errors in this case. Discuss the issues in determining the cutoff.

predBenign <- factor(predict(lgtFitR,type="response") > .9, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbca$Class)
kableN(xtab)

	Malign	Benign
pred-Mal.	237	16
pred-Ben.	1	427

pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError2 <- percents( table(as.integer(predBenign)==as.integer(wbca$Class)))[1]

If we use .9 as the cutoff, of patients with benign tumours, 16( 3.6%) will be incorrectly predicted to be malignant. Of patients with malignant tumours, 1( 0.4%) will be incorrectly predicted to be benign.

As we increase the cut-off from .5 to .9, we see that we decrease the number of malignant cases falsely classified as benign, but increase the number of benign cases falsely classified as malignant.

(f, 3 points)

It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts — assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results.

everyThird <- (1:nrow(wbca))%%3 == 0
wbcaPart1 <- wbca[!everyThird,]
wbcaPart2 <- wbca[everyThird,]
lgtFit2 <- glm(Class ~ ., family=binomial,data=wbcaPart1)
lgtFit2R <- step(lgtFit2,direction="both")

predictPart2 <- predict(lgtFit2R,type="response",newdata=wbcaPart2)
predBenign <- factor(predictPart2 > .5, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbcaPart2$Class)
kableN(xtab)

	Malign	Benign
pred-Mal.	70	2
pred-Ben.	5	150

pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError3 <- percents( table(as.integer(predBenign)==as.integer(wbcaPart2$Class)))[1]

If we use .5 as the cut-off, of the 152 patients with benign tumours, 2( 1.3%) will be incorrectly predicted to be malignant. Of the 75 patients with malignant tumours, 5( 6.7%) will be incorrectly predicted to be benign.

predBenign <- factor(predictPart2 > .9, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbcaPart2$Class)
kableN(xtab)

	Malign	Benign
pred-Mal.	73	3
pred-Ben.	2	149

pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError4 <- percents( table(as.integer(predBenign)==as.integer(wbcaPart2$Class)))[1]

If we use .9 as the cut-off, of patients with benign tumours, 3( 2.0%) will be incorrectly predicted to be malignant. Of patients with malignant tumours, 2( 2.7%) will be incorrectly predicted to be benign.

In the naive assessment of error rates we have total error percentages of 20/681( 2.9%) and 17/681( 2.5%) for cut-offs at .5 and .9 respectively. In the more honest assessment of error rates we have total error percentages of 7/227( 3.1%) and 5/227( 2.2%) for cut-offs at .5 and .9 respectively. In general we would expect the naive approach would under-estimate predictive error compared to a split sample approach, though this has not occurred in this rather simple experiment.