The dataset wbca comes from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration, which draws only a small sample of tissue, could be effective in determining tumor status.
As a rule one should examine summaries and preliminary plots of the data before jumping to the more formal analysis.
data(wbca)
#?wbca
wbca <- within(wbca, Class <- factor(Class, labels=c("Malign","Benign")))
summary(wbca)
## Class Adhes BNucl Chrom
## Malign:238 Min. : 1.000 Min. : 1.000 Min. : 1.000
## Benign:443 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 2.816 Mean : 3.542 Mean : 3.433
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000
## Epith Mitos NNucl Thick
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000
## Median : 2.000 Median : 1.000 Median : 1.000 Median : 4.000
## Mean : 3.231 Mean : 1.604 Mean : 2.859 Mean : 4.436
## 3rd Qu.: 4.000 3rd Qu.: 1.000 3rd Qu.: 4.000 3rd Qu.: 6.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
## UShap USize
## Min. : 1.000 Min. : 1.00
## 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 1.000 Median : 1.00
## Mean : 3.204 Mean : 3.14
## 3rd Qu.: 5.000 3rd Qu.: 5.00
## Max. :10.000 Max. :10.00
clear(3,3)
wbcas <- within(wbca, levels(Class) <- c("M","B"))
prelim.plot(Class ~ ., data=wbcas,plot.resp=FALSE)
We note that for all predictor variables, larger values tend to indicate malignancy.
Fit a binomial regression with Class as the response and the other nine variables as predictors. Report the residual deviance and associated degrees of freedom. Can this information be used to determine if this model fits the data? Explain.
lgtFit <- glm(Class ~ ., family=binomial,data=wbca)
summary(lgtFit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = wbca)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48282 -0.01179 0.04739 0.09678 3.06425
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.16678 1.41491 7.892 2.97e-15 ***
## Adhes -0.39681 0.13384 -2.965 0.00303 **
## BNucl -0.41478 0.10230 -4.055 5.02e-05 ***
## Chrom -0.56456 0.18728 -3.014 0.00257 **
## Epith -0.06440 0.16595 -0.388 0.69795
## Mitos -0.65713 0.36764 -1.787 0.07387 .
## NNucl -0.28659 0.12620 -2.271 0.02315 *
## Thick -0.62675 0.15890 -3.944 8.01e-05 ***
## UShap -0.28011 0.25235 -1.110 0.26699
## USize 0.05718 0.23271 0.246 0.80589
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 881.388 on 680 degrees of freedom
## Residual deviance: 89.464 on 671 degrees of freedom
## AIC: 109.46
##
## Number of Fisher Scoring iterations: 8
As we expect from the initial plots above, the signs of the estimated coefficients are all negative, indicating that scores are negatively associated with benign results.
The residual deviance is 89.5 with 671 degrees of freedom. This information is not useful in assessing the fit of the model since the data is in unreplicated format. In this format the residual deviance has 671 degrees of freedom which is too numerous for asymptotic theory to be relevant.
Use AIC as the criterion to determine the best subset of variables. (Use the step function.)
lgtFitR <- step(lgtFit,direction="both")
summary(lgtFitR)
##
## Call:
## glm(formula = Class ~ Adhes + BNucl + Chrom + Mitos + NNucl +
## Thick + UShap, family = binomial, data = wbca)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44161 -0.01119 0.04962 0.09741 3.08205
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.0333 1.3632 8.094 5.79e-16 ***
## Adhes -0.3984 0.1294 -3.080 0.00207 **
## BNucl -0.4192 0.1020 -4.111 3.93e-05 ***
## Chrom -0.5679 0.1840 -3.085 0.00203 **
## Mitos -0.6456 0.3634 -1.777 0.07561 .
## NNucl -0.2915 0.1236 -2.358 0.01837 *
## Thick -0.6216 0.1579 -3.937 8.27e-05 ***
## UShap -0.2541 0.1785 -1.423 0.15461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 881.388 on 680 degrees of freedom
## Residual deviance: 89.662 on 673 degrees of freedom
## AIC: 105.66
##
## Number of Fisher Scoring iterations: 8
We reduction using AIC has only eliminated 2 variables. Most of the remaining variables are statistically significant but not all, since AIC is based on information theory rather than classical hypothesis testing.
Use the reduced model to predict the outcome for a new patient with predictor variables 1, 1, 3, 2, 1, 1, 4, 1, 1 (same order as above). Give a confidence interval for your prediction.
newrow <- wbca[1,]
newrow[1,-1] <- c(1, 1, 3, 2, 1, 1, 4, 1, 1)
newpred <- predict(lgtFitR,newdata=newrow,type="link",se.fit=TRUE)
lgtVals <- newpred$fit + 1.96*c(0,-1,1)*newpred$se.fit
probVals <- round(1/(1+exp(-lgtVals)),3)
The estimated probability of benign tumour for a new patient as above is 0.992 with a 95% confidence interval of 0.976 to 0.997
Suppose that a cancer is classified as benign if p > 0.5 and malignant if p < 0.5. Compute the number of errors of both types that will be made if this method is applied to the current data with the reduced model.
predBenign <- factor(predict(lgtFitR,type="response") > .5, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbca$Class)
kableN(xtab)
Malign | Benign | |
---|---|---|
pred-Mal. | 227 | 9 |
pred-Ben. | 11 | 434 |
pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError1 <- percents( table(as.integer(predBenign)==as.integer(wbca$Class)))[1]
If we use .5 as the cutoff, of the 443 patients with benign tumours, 9( 2.0%) will be incorrectly predicted to be malignant. Of the 238 patients with malignant tumours, 11( 4.6%) will be incorrectly predicted to be benign.
Suppose we change the cutoff to 0.9 so that p < 0.9 is classified as malignant and p > 0.9 as benign. Compute the number of errors in this case. Discuss the issues in determining the cutoff.
predBenign <- factor(predict(lgtFitR,type="response") > .9, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbca$Class)
kableN(xtab)
Malign | Benign | |
---|---|---|
pred-Mal. | 237 | 16 |
pred-Ben. | 1 | 427 |
pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError2 <- percents( table(as.integer(predBenign)==as.integer(wbca$Class)))[1]
If we use .9 as the cutoff, of patients with benign tumours, 16( 3.6%) will be incorrectly predicted to be malignant. Of patients with malignant tumours, 1( 0.4%) will be incorrectly predicted to be benign.
As we increase the cut-off from .5 to .9, we see that we decrease the number of malignant cases falsely classified as benign, but increase the number of benign cases falsely classified as malignant.
It is usually misleading to use the same data to fit a model and test its predictive ability. To investigate this, split the data into two parts — assign every third observation to a test set and the remaining two thirds of the data to a training set. Use the training set to determine the model and the test set to assess its predictive performance. Compare the outcome to the previously obtained results.
everyThird <- (1:nrow(wbca))%%3 == 0
wbcaPart1 <- wbca[!everyThird,]
wbcaPart2 <- wbca[everyThird,]
lgtFit2 <- glm(Class ~ ., family=binomial,data=wbcaPart1)
lgtFit2R <- step(lgtFit2,direction="both")
predictPart2 <- predict(lgtFit2R,type="response",newdata=wbcaPart2)
predBenign <- factor(predictPart2 > .5, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbcaPart2$Class)
kableN(xtab)
Malign | Benign | |
---|---|---|
pred-Mal. | 70 | 2 |
pred-Ben. | 5 | 150 |
pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError3 <- percents( table(as.integer(predBenign)==as.integer(wbcaPart2$Class)))[1]
If we use .5 as the cut-off, of the 152 patients with benign tumours, 2( 1.3%) will be incorrectly predicted to be malignant. Of the 75 patients with malignant tumours, 5( 6.7%) will be incorrectly predicted to be benign.
predBenign <- factor(predictPart2 > .9, labels=c("pred-Mal.","pred-Ben."))
xtab <- table(predBenign,wbcaPart2$Class)
kableN(xtab)
Malign | Benign | |
---|---|---|
pred-Mal. | 73 | 3 |
pred-Ben. | 2 | 149 |
pcts <- percents(xtab,denom=2,print.denom=FALSE)
totError4 <- percents( table(as.integer(predBenign)==as.integer(wbcaPart2$Class)))[1]
If we use .9 as the cut-off, of patients with benign tumours, 3( 2.0%) will be incorrectly predicted to be malignant. Of patients with malignant tumours, 2( 2.7%) will be incorrectly predicted to be benign.
In the naive assessment of error rates we have total error percentages of 20/681( 2.9%) and 17/681( 2.5%) for cut-offs at .5 and .9 respectively. In the more honest assessment of error rates we have total error percentages of 7/227( 3.1%) and 5/227( 2.2%) for cut-offs at .5 and .9 respectively. In general we would expect the naive approach would under-estimate predictive error compared to a split sample approach, though this has not occurred in this rather simple experiment.