Recall that a discrete variable Y is said to have a Poisson distribution, if it satisfies
Pr(Y=y)=e−μμ−yy!,y=0,1,2,…
Note that E(Y)=Var(Y)=μ
The Poisson distribution can be derived as a limiting distribution of Y∼Binomial(n,p) when n−>∞ and p−>0 in such a way that np−>μ.
(from Cameron & Travidi, Chapter 1) - derived as a limiting case of the binomial (Poisson, 1837) - early applications: - annual number of deaths in the Prussian army from being kicked by mules (Bortkiewicz, 1898) - epidemic model - negative binomial distribution used to model overdispersion in disease counts due to heterogeneous contagion (Greenwood and Yule, 1920)
Health economics research, linking health service utilization (e.g. number of visits to doctors, number of days in hospital) to non-health variables such as income, price, insured status
Insurance: modeling the frequency of accidents and the number of insurance claims
Politics - frequency with which U.S. presidents were able to appoint U.S. Supreme Court Justices (King, 1987a).
Criminal justice - based on longitudinal data on 411 men for 20 years to study the number of recorded criminal offenses as a function of observable traits of criminals (Nagin and Land, 1993)
Academic productivity: based on a sample of about 900 doctoral candidates, Long (1997) analyzes the relation between the number of doctoral publications in the final three years of Ph.D. studies and gender, marital status, number of young children, prestige of Ph.D. department, and number of articles by mentor in the preceding three years
Manufacturing: the number of defects per area in a manufacturing process is studied by Lambert (1992) using data from a soldering experiment at AT&T Bell Laboratories.
In this example we examine the yearly incidence of shark attacks for the state of Florida from 1946 to 1999, as described below by Jeffrey Simonoff in his book Analyzing Categorical Data.
The Florida Museum of Natural History maintains the International Shark Attack File, a record of unprovoked shark attacks (and resultant fatalities) worldwide. Patterns in Florida are particularly interesting since the peninsular nature of the state means that almost all Florida residents and visitors are within 90 minutes of the Atlantic Ocean or Gulf of Mexico. Further, since population figures for Florida are available, it is possible to account for changes in coastal population by examining patterns in attack rates, rather than simply attack counts.
Here are plots of population size, numbers of attacks, and rates of attacks (per 100,000 person years) against calendar year. The data is available at the web-page noted above under the name floridashark.
To analyze the data we first consider using a Poisson model as shark attacks are reasonably rare events. We can do this by applying a generalized linear model assuming the Poisson model using a log link to connect the Poisson mean μ with rate change over time. Dividing fitted means by population size yields a model for rates. To do this we introduce an offset term for the log of the population size on the right hand side of the model equation.
Here are the model fitting steps and a plot of the resultant fit. Note that the time trend follows an exponential pattern rather than linear, due to the use of the log link.
poisRegFit <- glm( Attacks ~ offset(log(Pop100Thou)) + I(Year-1946), family=poisson, data=sharks)
plot(sharks$Year,sharks$Rate,xlab="Year", ylab="Rate")
predRate <- predict(poisRegFit,type="response")/(sharks$Pop100Thou)
lines(sharks$Year,predRate,col=2)
Here is the default summary output for the generalized model plus a transformed version. Note that the both summaries indicate that the rate of shark attacks is rising by approximately 3% a year according to the model.
##
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + I(Year - 1946),
## family = poisson, data = sharks)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1470 -1.2001 -0.3177 0.7281 3.4856
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.614831 0.178844 -20.212 < 2e-16 ***
## I(Year - 1946) 0.031174 0.004361 7.148 8.8e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 176.93 on 53 degrees of freedom
## Residual deviance: 119.11 on 52 degrees of freedom
## AIC: 288.76
##
## Number of Fisher Scoring iterations: 5
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.0269215 | 0.1788439 | -20.212212 | 0 | 0.018763 | 0.0378397 |
I(Year - 1946) | 1.0316649 | 0.0043612 | 7.148051 | 0 | 1.023034 | 1.0406839 |
fits <- predict(poisRegFit,type="response")
residSquared <- (sharks$Attacks - fits)^2
Here are the summaries for a quasiposson fit.
##
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + Year, family = quasipoisson,
## data = sharks)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1470 -1.2001 -0.3177 0.7281 3.4856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.279344 13.021371 -4.936 8.60e-06 ***
## Year 0.031174 0.006559 4.753 1.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.26181)
##
## Null deviance: 176.93 on 53 degrees of freedom
## Residual deviance: 119.11 on 52 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.000000 | 13.0213706 | -4.93645 | 8.60e-06 | 0.000000 | 0.000000 |
Year | 1.031665 | 0.0065589 | 4.75291 | 1.62e-05 | 1.018819 | 1.045388 |
Data was collected by Doll and Hill in a ground-breaking follow-up study of smoking habits and lung cancer incidence in a cohort of British doctors. The data was aggregated by smoking status and 10 year age category. During the course of the study the physicians would age across category boundaries, so a person years approach was taken to determine denominators for rate calculations.
deaths | pyears | rate/10^4 pyears | deaths | pyears | rate/10^4 pyears | RR | |
---|---|---|---|---|---|---|---|
35-44 | 2 | 18790 | 1.064396 | 32 | 52407 | 6.106054 | 5.74 |
45-54 | 12 | 10673 | 11.243324 | 104 | 43248 | 24.047355 | 2.14 |
55-64 | 28 | 5710 | 49.036778 | 206 | 28612 | 71.997763 | 1.47 |
65-74 | 28 | 2585 | 108.317215 | 186 | 12663 | 146.884625 | 1.36 |
75-84 | 31 | 1462 | 212.038304 | 102 | 5317 | 191.837502 | 0.90 |
llfit <- glm(deaths ~ offset(log(pyears)) + smokec + agec, family=poisson, data=df3)
kableN(tidy(llfit,expo=TRUE,conf.int=TRUE))
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 0.0003636 | 0.1917612 | -41.297843 | 0.0000000 | 0.0002454 | 0.0005213 |
smokecyes | 1.4255185 | 0.1073740 | 3.301875 | 0.0009604 | 1.1608784 | 1.7692033 |
agec45-54 | 4.4105836 | 0.1951028 | 7.606281 | 0.0000000 | 3.0455744 | 6.5603262 |
agec55-64 | 13.8391996 | 0.1837267 | 14.301161 | 0.0000000 | 9.7999085 | 20.1835186 |
agec65-74 | 28.5167828 | 0.1847986 | 18.130508 | 0.0000000 | 20.1458266 | 41.6671774 |
agec75-84 | 40.4512058 | 0.1922190 | 19.249382 | 0.0000000 | 28.1114400 | 59.8685311 |
anova(llfit,test="Chi")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: deaths
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 9 935.07
## smokec 1 29.09 8 905.98 6.905e-08 ***
## agec 4 893.84 4 12.13 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The residual deviance is statisticially significant, indicating presence of an interaction. Rather than using up all the degrees of freedom, we note a decreasing trend in the relative risks by age group that suggests using a one degree of freedom interaction component
df3 <- within(df3, xact1df <- as.integer(agec)* as.integer(smokec))
llfit <- glm(deaths ~ offset(log(pyears)) + agec + smokec + xact1df,
family=poisson, data=df3)
summary(llfit)
##
## Call:
## glm(formula = deaths ~ offset(log(pyears)) + agec + smokec +
## xact1df, family = poisson, data = df3)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## -0.87734 0.20744 0.61999 -0.43806 0.00608 0.27117 -0.06891
## 8 9 10
## -0.21863 0.17544 -0.00335
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.27687 0.23916 -34.609 < 2e-16 ***
## agec45-54 2.04327 0.26781 7.630 2.35e-14 ***
## agec55-64 3.76578 0.41228 9.134 < 2e-16 ***
## agec65-74 5.06816 0.58228 8.704 < 2e-16 ***
## agec75-84 5.96584 0.74648 7.992 1.33e-15 ***
## smokecyes 1.44494 0.37289 3.875 0.000107 ***
## xact1df -0.30873 0.09727 -3.174 0.001503 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 935.067 on 9 degrees of freedom
## Residual deviance: 1.546 on 3 degrees of freedom
## AIC: 70.614
##
## Number of Fisher Scoring iterations: 4
The residual deviance indicates a satisfactory fit. Comparing observed and fitted age-specific relative risks we see:
Observed | Fitted | |
---|---|---|
35-44 | 5.74 | 3.11 |
45-54 | 2.14 | 2.29 |
55-64 | 1.47 | 1.68 |
65-74 | 1.36 | 1.23 |
75-84 | 0.90 | 0.91 |
Changing the parameterization of the Γ distribution leads to a negative binomial distribution for Y, with E(Y)=μ and Var(Y)=(1+α)μ.
Example
library(MASS)
negbinRegFit <- glm.nb( Attacks ~ offset(log(Pop100Thou)) + Year, data=sharks)
summary(negbinRegFit)
##
## Call:
## glm.nb(formula = Attacks ~ offset(log(Pop100Thou)) + Year, data = sharks,
## init.theta = 5.429471356, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6803 -0.8321 -0.2452 0.4347 1.9864
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -58.22912 12.35427 -4.713 2.44e-06 ***
## Year 0.02812 0.00624 4.506 6.60e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(5.4295) family taken to be 1)
##
## Null deviance: 84.819 on 53 degrees of freedom
## Residual deviance: 61.274 on 52 degrees of freedom
## AIC: 270.14
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 5.43
## Std. Err.: 2.25
##
## 2 x log-likelihood: -264.137