Poisson regression models

Recall that a discrete variable \(Y\) is said to have a Poisson distribution, if it satisfies

\[ Pr(Y=y) = \frac{e^{-\mu}\mu^{-y}}{y !}, y=0,1,2,\dots \]

Note that \(E(Y) = Var(Y) = \mu\)

The Poisson distribution can be derived as a limiting distribution of \(Y \sim Binomial(n,p)\) when \(n \to \infty\) and \(p \to 0\) in such a way that \(np \to \mu\).

e.g. random independent events in space and/or time
- meteor hits, rare disease occurrences

History of the Poisson distribution

(from Cameron & Travidi, Chapter 1)

derived as a limiting case of the binomial (Poisson, 1837)
early applications:
- annual number of deaths in the Prussian army from being kicked by mules (Bortkiewicz, 1898)
- epidemic model - negative binomial distribution used to model overdispersion in disease counts due to heterogeneous contagion (Greenwood and Yule, 1920)

Modern applications

Health economics research, linking health service utilization (e.g. number of visits to doctors, number of days in hospital) to non-health variables such as income, price, insured status
Insurance: modeling the frequency of accidents and the number of insurance claims
Politics - frequency with which U.S. presidents were able to appoint U.S. Supreme Court Justices (King, 1987a).
Criminal justice - based on longitudinal data on 411 men for 20 years to study the number of recorded criminal offenses as a function of observable traits of criminals (Nagin and Land, 1993)
Academic productivity: based on a sample of about 900 doctoral candidates, Long (1997) analyzes the relation between the number of doctoral publications in the ﬁnal three years of Ph.D. studies and gender, marital status, number of young children, prestige of Ph.D. department, and number of articles by mentor in the preceding three years
Manufacturing: the number of defects per area in a manufacturing process is studied by Lambert (1992) using data from a soldering experiment at AT&T Bell Laboratories.

Rare events example

In this example we examine the yearly incidence of shark attacks for the state of Florida from 1946 to 1999, as described below by Jeffrey Simonoff in his book Analyzing Categorical Data.

The Florida Museum of Natural History maintains the International Shark Attack File, a record of unprovoked shark attacks (and resultant fatalities) worldwide. Patterns in Florida are particularly interesting since the peninsular nature of the state means that almost all Florida residents and visitors are within 90 minutes of the Atlantic Ocean or Gulf of Mexico. Further, since population figures for Florida are available, it is possible to account for changes in coastal population by examining patterns in attack rates, rather than simply attack counts.

Here are plots of population size, numbers of attacks, and rates of attacks (per 100,000 person years) against calendar year. The data is available at the web-page noted above under the name \(floridashark\).

To analyze the data we first consider using a Poisson model as shark attacks are reasonably rare events. We can do this by applying a generalized linear model assuming the Poisson model using a log link to connect the Poisson mean \(\mu\) with rate change over time. Dividing fitted means by population size yields a model for rates. To do this we introduce an offset term for the log of the population size on the right hand side of the model equation.

Here are the model fitting steps and a plot of the resultant fit. Note that the time trend follows an exponential pattern rather than linear, due to the use of the log link.

poisRegFit <- glm( Attacks ~ offset(log(Pop100Thou)) + I(Year-1946), family=poisson, data=sharks)
plot(sharks$Year,sharks$Rate,xlab="Year", ylab="Rate")
predRate <- predict(poisRegFit,type="response")/(sharks$Pop100Thou)
lines(sharks$Year,predRate,col=2)

Here is the default summary output for the generalized model plus a transformed version. Note that the both summaries indicate that the rate of shark attacks is rising by approximately 3% a year according to the model.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + I(Year - 1946), 
##     family = poisson, data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.614831   0.178844 -20.212  < 2e-16 ***
## I(Year - 1946)  0.031174   0.004361   7.148  8.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: 288.76
## 
## Number of Fisher Scoring iterations: 5

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.0269215	0.1788439	-20.212212	0	0.018763	0.0378397
I(Year - 1946)	1.0316649	0.0043612	7.148051	0	1.023034	1.0406839

fits <- predict(poisRegFit,type="response")
residSquared <-  (sharks$Attacks -  fits)^2
resids <- sharks$Attacks - fits

Overdispersion

Here are the summaries for a quasiposson fit.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + Year, family = quasipoisson, 
##     data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.279344  13.021371  -4.936 8.60e-06 ***
## Year          0.031174   0.006559   4.753 1.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.26181)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.000000	13.0213706	-4.93645	8.60e-06	0.000000	0.000000
Year	1.031665	0.0065589	4.75291	1.62e-05	1.018819	1.045388

Fitting non-linear relationships with splines

As mentioned in last Wednesday’s lectures, spline functions provide a foundation for fitting a wide variety of non-linear patterns while avoiding the numerical problems associated with use of polynomials.

In general splines are piecewise polynomical functions, joined at points on the x-axis called knots
Section 2.4.3 in Harrell’s “Regression Modelling Strategies” provides a quick self-contained introduction

Linear splines

\[f ( x ) = \beta _ { 0 } + \beta _ { 1 } x + \beta _ { 2 } ( x - a ) _ { + } + \beta _ { 3 } ( x - b ) _ { + } + \beta _ { 4 } ( x - c ) _ { + } \]

where

\[\begin{aligned} ( u ) _ { + } = u , u & > 0 \\ 0 , & u \leq 0 \end{aligned}\]

Natural cubic splines

\[f ( x ) = \beta _ { 0 } + \beta _ { 1 } x _ { 1 } + \beta _ { 2 } x _ { 2 } + \ldots + \beta _ { k - 1 } x _ { k - 1 }\]

\[\text { where } x _ { 1 } = x \text { and for } j = 1 , \ldots , k - 2\]

\[\begin{aligned} x _ { j + 1 } & = \left( x - t _ { j } \right) _ { + } ^ { 3 } - \left( x - t _ { k - 1 } \right) _ { + } ^ { 3 } \left( t _ { k } - t _ { j } \right) / \left( t _ { k } - t _ { k - 1 } \right) \\ & + \left( x - t _ { k } \right) _ { + } ^ { 3 } \left( t _ { k - 1 } - t _ { j } \right) / \left( t _ { k } - t _ { k - 1 } \right) \end{aligned}\]

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + ns(Year, 5), 
##     family = quasipoisson, data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0629  -0.8824  -0.4366   0.5879   4.1116  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.9291     1.3184  -3.739 0.000493 ***
## ns(Year, 5)1   2.0074     1.2466   1.610 0.113874    
## ns(Year, 5)2   2.1586     1.4457   1.493 0.141955    
## ns(Year, 5)3   1.6099     0.8765   1.837 0.072437 .  
## ns(Year, 5)4   5.0974     2.8227   1.806 0.077217 .  
## ns(Year, 5)5   1.7772     0.4949   3.591 0.000772 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.217138)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 104.09  on 48  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Analyzing Aggregated Counts

Data was collected by Doll and Hill in a ground-breaking follow-up study of smoking habits and lung cancer incidence in a cohort of British doctors. The data was aggregated by smoking status and 10 year age category. During the course of the study the physicians would age across category boundaries, so a person years approach was taken to determine denominators for rate calculations.

	British Doctor’s Study
	non-Smokers			Smokers
	deaths	pyears	rate/10^4 pyears	deaths	pyears	rate/10^4 pyears	RR
35-44	2	18790	1.064396	32	52407	6.106054	5.74
45-54	12	10673	11.243324	104	43248	24.047355	2.14
55-64	28	5710	49.036778	206	28612	71.997763	1.47
65-74	28	2585	108.317215	186	12663	146.884625	1.36
75-84	31	1462	212.038304	102	5317	191.837502	0.90

Analyzing rates with Poisson regression, a.k.a. log-linear regression

llfit <- glm(deaths ~ offset(log(pyears)) + smokec + agec, family=poisson, data=df3)
kableN(tidy(llfit,expo=TRUE,conf.int=TRUE))

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.0003636	0.1917612	-41.297843	0.0000000	0.0002454	0.0005213
smokecyes	1.4255185	0.1073740	3.301875	0.0009604	1.1608784	1.7692033
agec45-54	4.4105836	0.1951028	7.606281	0.0000000	3.0455744	6.5603262
agec55-64	13.8391996	0.1837267	14.301161	0.0000000	9.7999085	20.1835186
agec65-74	28.5167828	0.1847986	18.130508	0.0000000	20.1458266	41.6671774
agec75-84	40.4512058	0.1922190	19.249382	0.0000000	28.1114400	59.8685311

anova(llfit,test="Chi")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: deaths
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       9     935.07              
## smokec  1    29.09         8     905.98 6.905e-08 ***
## agec    4   893.84         4      12.13 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance is statisticially significant, indicating presence of an interaction. Rather than using up all the degrees of freedom, we note a decreasing trend in the relative risks by age group that suggests using a one degree of freedom interaction component

df3 <- within(df3, xact1df <- as.integer(agec)* as.integer(smokec))
llfit <- glm(deaths ~ offset(log(pyears)) + agec + smokec + xact1df,
    family=poisson, data=df3)
summary(llfit)

## 
## Call:
## glm(formula = deaths ~ offset(log(pyears)) + agec + smokec + 
##     xact1df, family = poisson, data = df3)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.87734   0.20744   0.61999  -0.43806   0.00608   0.27117  -0.06891  
##        8         9        10  
## -0.21863   0.17544  -0.00335  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.27687    0.23916 -34.609  < 2e-16 ***
## agec45-54    2.04327    0.26781   7.630 2.35e-14 ***
## agec55-64    3.76578    0.41228   9.134  < 2e-16 ***
## agec65-74    5.06816    0.58228   8.704  < 2e-16 ***
## agec75-84    5.96584    0.74648   7.992 1.33e-15 ***
## smokecyes    1.44494    0.37289   3.875 0.000107 ***
## xact1df     -0.30873    0.09727  -3.174 0.001503 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 935.067  on 9  degrees of freedom
## Residual deviance:   1.546  on 3  degrees of freedom
## AIC: 70.614
## 
## Number of Fisher Scoring iterations: 4

The residual deviance indicates a satisfactory fit. Comparing observed and fitted age-specific relative risks we see:

	Observed	Fitted
35-44	5.74	3.11
45-54	2.14	2.29
55-64	1.47	1.68
65-74	1.36	1.23
75-84	0.90	0.91

Parametric models for Overdispersion

Negative binomial models

Recall a negative binomial distribution describes the probability of x failures before the \(r^{th}\) success in a series of independent Bernoulli trials with success probability \(p\).

Probability distribtion function for the negative binomial:

\[P ( X = x ) = \left( \begin{array} { c } { x + r - 1 } \\ { x } \end{array} \right) p ^ { r } ( 1 - p ) ^ { x }\]

Negative binomial can be derived as a mixture model.
Mixture model is constructed from a fixed parameter model by letting a parameter vary randomly.
- Normal mixture model for outliers, \(\epsilon \sim N(0,k\sigma^2)\), where \(k = 1\) with probability \(1-\delta\) and \(k=3\) with probability \(\delta\).
- e.g. Poisson stopped binomial, \(Y \sim Binomial(N,p)\) where \(N \sim Poisson(\mu)\)
- result is just another Poisson !!
- negative binomial arises from Poisson by letting \(\mu\) follow a \(\Gamma\) (Gamma) distribution
- varying parameterizations of \(\Gamma\) give rise to different forms of negative binomial

Negative binomial, type 2

\[\begin{array} { c } { f ( y | \mu , \alpha ) = \frac { \Gamma \left( y + \alpha ^ { - 1 } \right) } { \Gamma ( y + 1 ) \Gamma \left( \alpha ^ { - 1 } \right) } \left( \frac { \alpha ^ { - 1 } } { \alpha ^ { - 1 } + \mu } \right) ^ { a } \left( \frac { \mu } { \alpha ^ { - 1 } + \mu } \right) ^ { y } } \\ { \alpha \geq 0 , y = 0,1,2 , \ldots } \end{array}\]

formalism above not too helpful, but yields \(E(Y) = \mu\) and \(Var(Y) = \mu + \alpha \mu^2\)

Negative binomial, type 1

Changing the parameterization of the \(\Gamma\) distribution leads to a negative binomial distribution for \(Y\), with \(E(Y)=\mu\) and \(Var(Y) = (1+\alpha)\mu\).

Note that letting \(\phi = (1+\alpha)\) yields \(Var(Y) = \phi \mu\)
- i.e. is overdispersion parameter, since \(Var(Y) = \mu\) same as \(Poisson\) when \(\phi =1\)

Example

library(MASS)
negbinRegFit <- glm.nb( Attacks ~ offset(log(Pop100Thou)) + ns(Year,4), data=sharks)
summary(negbinRegFit)

## 
## Call:
## glm.nb(formula = Attacks ~ offset(log(Pop100Thou)) + ns(Year, 
##     4), data = sharks, init.theta = 7.346669212, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6564  -0.7973  -0.2495   0.4540   2.8194  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.6442     0.7442  -6.241 4.35e-10 ***
## ns(Year, 4)1   1.5776     0.7131   2.212  0.02694 *  
## ns(Year, 4)2   1.3127     0.5790   2.267  0.02337 *  
## ns(Year, 4)3   4.6425     1.6444   2.823  0.00475 ** 
## ns(Year, 4)4   1.5198     0.3578   4.248 2.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(7.3467) family taken to be 1)
## 
##     Null deviance: 95.768  on 53  degrees of freedom
## Residual deviance: 60.325  on 49  degrees of freedom
## AIC: 268.41
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  7.35 
##           Std. Err.:  3.48 
## 
##  2 x log-likelihood:  -256.412

Note that the overdispersion parameter \(Theta = \frac{1}{\alpha}\) is estimated by maximum likelihood. Large values of Theta indicate homogeneous Poisson behaviour.