Poisson regression models

Recall that a discrete variable \(Y\) is said to have a Poisson distribution, if it satisfies

\[ Pr(Y=y) = \frac{e^{-\mu}\mu^{-y}}{y !}, y=0,1,2,\dots \]

Note that \(E(Y) = Var(Y) = \mu\)

The Poisson distribution can be derived as a limiting distribution of \(Y \sim Binomial(n,p)\) when \(n \to \infty\) and \(p \to 0\) in such a way that \(np \to \mu\).

  • e.g. random independent events in space and/or time
    • meteor hits, rare disease occurrences

History of the Poisson distribution

(from Cameron & Travidi, Chapter 1)

  • derived as a limiting case of the binomial (Poisson, 1837)
  • early applications:
    • annual number of deaths in the Prussian army from being kicked by mules (Bortkiewicz, 1898)
    • epidemic model - negative binomial distribution used to model overdispersion in disease counts due to heterogeneous contagion (Greenwood and Yule, 1920)

Modern applications

  • Health economics research, linking health service utilization (e.g. number of visits to doctors, number of days in hospital) to non-health variables such as income, price, insured status

  • Insurance: modeling the frequency of accidents and the number of insurance claims

  • Politics - frequency with which U.S. presidents were able to appoint U.S. Supreme Court Justices (King, 1987a).

  • Criminal justice - based on longitudinal data on 411 men for 20 years to study the number of recorded criminal offenses as a function of observable traits of criminals (Nagin and Land, 1993)

  • Academic productivity: based on a sample of about 900 doctoral candidates, Long (1997) analyzes the relation between the number of doctoral publications in the final three years of Ph.D. studies and gender, marital status, number of young children, prestige of Ph.D. department, and number of articles by mentor in the preceding three years

  • Manufacturing: the number of defects per area in a manufacturing process is studied by Lambert (1992) using data from a soldering experiment at AT&T Bell Laboratories.

Rare events example

In this example we examine the yearly incidence of shark attacks for the state of Florida from 1946 to 1999, as described below by Jeffrey Simonoff in his book Analyzing Categorical Data.

The Florida Museum of Natural History maintains the International Shark Attack File, a record of unprovoked shark attacks (and resultant fatalities) worldwide. Patterns in Florida are particularly interesting since the peninsular nature of the state means that almost all Florida residents and visitors are within 90 minutes of the Atlantic Ocean or Gulf of Mexico. Further, since population figures for Florida are available, it is possible to account for changes in coastal population by examining patterns in attack rates, rather than simply attack counts.

Here are plots of population size, numbers of attacks, and rates of attacks (per 100,000 person years) against calendar year. The data is available at the web-page noted above under the name \(floridashark\).

To analyze the data we first consider using a Poisson model as shark attacks are reasonably rare events. We can do this by applying a generalized linear model assuming the Poisson model using a log link to connect the Poisson mean \(\mu\) with rate change over time. Dividing fitted means by population size yields a model for rates. To do this we introduce an offset term for the log of the population size on the right hand side of the model equation.

Here are the model fitting steps and a plot of the resultant fit. Note that the time trend follows an exponential pattern rather than linear, due to the use of the log link.

poisRegFit <- glm( Attacks ~ offset(log(Pop100Thou)) + I(Year-1946), family=poisson, data=sharks)
plot(sharks$Year,sharks$Rate,xlab="Year", ylab="Rate")
predRate <- predict(poisRegFit,type="response")/(sharks$Pop100Thou)
lines(sharks$Year,predRate,col=2)

Here is the default summary output for the generalized model plus a transformed version. Note that the both summaries indicate that the rate of shark attacks is rising by approximately 3% a year according to the model.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + I(Year - 1946), 
##     family = poisson, data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.614831   0.178844 -20.212  < 2e-16 ***
## I(Year - 1946)  0.031174   0.004361   7.148  8.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: 288.76
## 
## Number of Fisher Scoring iterations: 5
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0269215 0.1788439 -20.212212 0 0.018763 0.0378397
I(Year - 1946) 1.0316649 0.0043612 7.148051 0 1.023034 1.0406839
fits <- predict(poisRegFit,type="response")
residSquared <-  (sharks$Attacks -  fits)^2
resids <- sharks$Attacks - fits

Overdispersion

Here are the summaries for a quasiposson fit.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + Year, family = quasipoisson, 
##     data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.279344  13.021371  -4.936 8.60e-06 ***
## Year          0.031174   0.006559   4.753 1.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.26181)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.000000 13.0213706 -4.93645 8.60e-06 0.000000 0.000000
Year 1.031665 0.0065589 4.75291 1.62e-05 1.018819 1.045388

Fitting non-linear relationships with splines

As mentioned in last Wednesday’s lectures, spline functions provide a foundation for fitting a wide variety of non-linear patterns while avoiding the numerical problems associated with use of polynomials.

  • In general splines are piecewise polynomical functions, joined at points on the x-axis called knots
  • Section 2.4.3 in Harrell’s “Regression Modelling Strategies” provides a quick self-contained introduction

Linear splines

\[f ( x ) = \beta _ { 0 } + \beta _ { 1 } x + \beta _ { 2 } ( x - a ) _ { + } + \beta _ { 3 } ( x - b ) _ { + } + \beta _ { 4 } ( x - c ) _ { + } \]

where

\[\begin{aligned} ( u ) _ { + } = u , u & > 0 \\ 0 , & u \leq 0 \end{aligned}\]

Natural cubic splines

\[f ( x ) = \beta _ { 0 } + \beta _ { 1 } x _ { 1 } + \beta _ { 2 } x _ { 2 } + \ldots + \beta _ { k - 1 } x _ { k - 1 }\]

\[\text { where } x _ { 1 } = x \text { and for } j = 1 , \ldots , k - 2\]

\[\begin{aligned} x _ { j + 1 } & = \left( x - t _ { j } \right) _ { + } ^ { 3 } - \left( x - t _ { k - 1 } \right) _ { + } ^ { 3 } \left( t _ { k } - t _ { j } \right) / \left( t _ { k } - t _ { k - 1 } \right) \\ & + \left( x - t _ { k } \right) _ { + } ^ { 3 } \left( t _ { k - 1 } - t _ { j } \right) / \left( t _ { k } - t _ { k - 1 } \right) \end{aligned}\]

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + ns(Year, 5), 
##     family = quasipoisson, data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0629  -0.8824  -0.4366   0.5879   4.1116  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.9291     1.3184  -3.739 0.000493 ***
## ns(Year, 5)1   2.0074     1.2466   1.610 0.113874    
## ns(Year, 5)2   2.1586     1.4457   1.493 0.141955    
## ns(Year, 5)3   1.6099     0.8765   1.837 0.072437 .  
## ns(Year, 5)4   5.0974     2.8227   1.806 0.077217 .  
## ns(Year, 5)5   1.7772     0.4949   3.591 0.000772 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.217138)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 104.09  on 48  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Analyzing Aggregated Counts

Data was collected by Doll and Hill in a ground-breaking follow-up study of smoking habits and lung cancer incidence in a cohort of British doctors. The data was aggregated by smoking status and 10 year age category. During the course of the study the physicians would age across category boundaries, so a person years approach was taken to determine denominators for rate calculations.

British Doctor’s Study
non-Smokers
Smokers
deaths pyears rate/10^4 pyears deaths pyears rate/10^4 pyears RR
35-44 2 18790 1.064396 32 52407 6.106054 5.74
45-54 12 10673 11.243324 104 43248 24.047355 2.14
55-64 28 5710 49.036778 206 28612 71.997763 1.47
65-74 28 2585 108.317215 186 12663 146.884625 1.36
75-84 31 1462 212.038304 102 5317 191.837502 0.90

Analyzing rates with Poisson regression, a.k.a. log-linear regression

llfit <- glm(deaths ~ offset(log(pyears)) + smokec + agec, family=poisson, data=df3)
kableN(tidy(llfit,expo=TRUE,conf.int=TRUE))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0003636 0.1917612 -41.297843 0.0000000 0.0002454 0.0005213
smokecyes 1.4255185 0.1073740 3.301875 0.0009604 1.1608784 1.7692033
agec45-54 4.4105836 0.1951028 7.606281 0.0000000 3.0455744 6.5603262
agec55-64 13.8391996 0.1837267 14.301161 0.0000000 9.7999085 20.1835186
agec65-74 28.5167828 0.1847986 18.130508 0.0000000 20.1458266 41.6671774
agec75-84 40.4512058 0.1922190 19.249382 0.0000000 28.1114400 59.8685311
anova(llfit,test="Chi")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: deaths
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       9     935.07              
## smokec  1    29.09         8     905.98 6.905e-08 ***
## agec    4   893.84         4      12.13 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance is statisticially significant, indicating presence of an interaction. Rather than using up all the degrees of freedom, we note a decreasing trend in the relative risks by age group that suggests using a one degree of freedom interaction component

df3 <- within(df3, xact1df <- as.integer(agec)* as.integer(smokec))
llfit <- glm(deaths ~ offset(log(pyears)) + agec + smokec + xact1df,
    family=poisson, data=df3)
summary(llfit)
## 
## Call:
## glm(formula = deaths ~ offset(log(pyears)) + agec + smokec + 
##     xact1df, family = poisson, data = df3)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.87734   0.20744   0.61999  -0.43806   0.00608   0.27117  -0.06891  
##        8         9        10  
## -0.21863   0.17544  -0.00335  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.27687    0.23916 -34.609  < 2e-16 ***
## agec45-54    2.04327    0.26781   7.630 2.35e-14 ***
## agec55-64    3.76578    0.41228   9.134  < 2e-16 ***
## agec65-74    5.06816    0.58228   8.704  < 2e-16 ***
## agec75-84    5.96584    0.74648   7.992 1.33e-15 ***
## smokecyes    1.44494    0.37289   3.875 0.000107 ***
## xact1df     -0.30873    0.09727  -3.174 0.001503 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 935.067  on 9  degrees of freedom
## Residual deviance:   1.546  on 3  degrees of freedom
## AIC: 70.614
## 
## Number of Fisher Scoring iterations: 4

The residual deviance indicates a satisfactory fit. Comparing observed and fitted age-specific relative risks we see:

Observed Fitted
35-44 5.74 3.11
45-54 2.14 2.29
55-64 1.47 1.68
65-74 1.36 1.23
75-84 0.90 0.91

Parametric models for Overdispersion

Negative binomial models

Recall a negative binomial distribution describes the probability of x failures before the \(r^{th}\) success in a series of independent Bernoulli trials with success probability \(p\).

Probability distribtion function for the negative binomial:

\[P ( X = x ) = \left( \begin{array} { c } { x + r - 1 } \\ { x } \end{array} \right) p ^ { r } ( 1 - p ) ^ { x }\]

  • Negative binomial can be derived as a mixture model.
  • Mixture model is constructed from a fixed parameter model by letting a parameter vary randomly.
    • Normal mixture model for outliers, \(\epsilon \sim N(0,k\sigma^2)\), where \(k = 1\) with probability \(1-\delta\) and \(k=3\) with probability \(\delta\).
    • e.g. Poisson stopped binomial, \(Y \sim Binomial(N,p)\) where \(N \sim Poisson(\mu)\)
    • result is just another Poisson !!
    • negative binomial arises from Poisson by letting \(\mu\) follow a \(\Gamma\) (Gamma) distribution
    • varying parameterizations of \(\Gamma\) give rise to different forms of negative binomial

Negative binomial, type 2

\[\begin{array} { c } { f ( y | \mu , \alpha ) = \frac { \Gamma \left( y + \alpha ^ { - 1 } \right) } { \Gamma ( y + 1 ) \Gamma \left( \alpha ^ { - 1 } \right) } \left( \frac { \alpha ^ { - 1 } } { \alpha ^ { - 1 } + \mu } \right) ^ { a } \left( \frac { \mu } { \alpha ^ { - 1 } + \mu } \right) ^ { y } } \\ { \alpha \geq 0 , y = 0,1,2 , \ldots } \end{array}\]
  • formalism above not too helpful, but yields \(E(Y) = \mu\) and \(Var(Y) = \mu + \alpha \mu^2\)

Negative binomial, type 1

Changing the parameterization of the \(\Gamma\) distribution leads to a negative binomial distribution for \(Y\), with \(E(Y)=\mu\) and \(Var(Y) = (1+\alpha)\mu\).

  • Note that letting \(\phi = (1+\alpha)\) yields \(Var(Y) = \phi \mu\)
    • i.e. is overdispersion parameter, since \(Var(Y) = \mu\) same as \(Poisson\) when \(\phi =1\)

Example

library(MASS)
negbinRegFit <- glm.nb( Attacks ~ offset(log(Pop100Thou)) + ns(Year,4), data=sharks)
summary(negbinRegFit)
## 
## Call:
## glm.nb(formula = Attacks ~ offset(log(Pop100Thou)) + ns(Year, 
##     4), data = sharks, init.theta = 7.346669212, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6564  -0.7973  -0.2495   0.4540   2.8194  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.6442     0.7442  -6.241 4.35e-10 ***
## ns(Year, 4)1   1.5776     0.7131   2.212  0.02694 *  
## ns(Year, 4)2   1.3127     0.5790   2.267  0.02337 *  
## ns(Year, 4)3   4.6425     1.6444   2.823  0.00475 ** 
## ns(Year, 4)4   1.5198     0.3578   4.248 2.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(7.3467) family taken to be 1)
## 
##     Null deviance: 95.768  on 53  degrees of freedom
## Residual deviance: 60.325  on 49  degrees of freedom
## AIC: 268.41
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  7.35 
##           Std. Err.:  3.48 
## 
##  2 x log-likelihood:  -256.412

Note that the overdispersion parameter \(Theta = \frac{1}{\alpha}\) is estimated by maximum likelihood. Large values of Theta indicate homogeneous Poisson behaviour.