Processing math: 100%
  • Poisson regression Part 1
    • History of the Poisson distribution
    • Modern applications
    • Rare events example
    • Overdispersion
    • Analyzing rates with Poisson regression, a.k.a. log-linear regression
    • Parametric models for Overdispersion
    • Negative Binomial Models

Poisson regression Part 1

Recall that a discrete variable Y is said to have a Poisson distribution, if it satisfies

Pr(Y=y)=eμμyy!,y=0,1,2,

Note that E(Y)=Var(Y)=μ

The Poisson distribution can be derived as a limiting distribution of YBinomial(n,p) when n> and p>0 in such a way that np>μ.

  • e.g. random independent events in space and/or time
    • meteor hits, rare disease occurrences

History of the Poisson distribution

(from Cameron & Travidi, Chapter 1) - derived as a limiting case of the binomial (Poisson, 1837) - early applications: - annual number of deaths in the Prussian army from being kicked by mules (Bortkiewicz, 1898) - epidemic model - negative binomial distribution used to model overdispersion in disease counts due to heterogeneous contagion (Greenwood and Yule, 1920)

Modern applications

  • Health economics research, linking health service utilization (e.g. number of visits to doctors, number of days in hospital) to non-health variables such as income, price, insured status

  • Insurance: modeling the frequency of accidents and the number of insurance claims

  • Politics - frequency with which U.S. presidents were able to appoint U.S. Supreme Court Justices (King, 1987a).

  • Criminal justice - based on longitudinal data on 411 men for 20 years to study the number of recorded criminal offenses as a function of observable traits of criminals (Nagin and Land, 1993)

  • Academic productivity: based on a sample of about 900 doctoral candidates, Long (1997) analyzes the relation between the number of doctoral publications in the final three years of Ph.D. studies and gender, marital status, number of young children, prestige of Ph.D. department, and number of articles by mentor in the preceding three years

  • Manufacturing: the number of defects per area in a manufacturing process is studied by Lambert (1992) using data from a soldering experiment at AT&T Bell Laboratories.

Rare events example

In this example we examine the yearly incidence of shark attacks for the state of Florida from 1946 to 1999, as described below by Jeffrey Simonoff in his book Analyzing Categorical Data.

The Florida Museum of Natural History maintains the International Shark Attack File, a record of unprovoked shark attacks (and resultant fatalities) worldwide. Patterns in Florida are particularly interesting since the peninsular nature of the state means that almost all Florida residents and visitors are within 90 minutes of the Atlantic Ocean or Gulf of Mexico. Further, since population figures for Florida are available, it is possible to account for changes in coastal population by examining patterns in attack rates, rather than simply attack counts.

Here are plots of population size, numbers of attacks, and rates of attacks (per 100,000 person years) against calendar year. The data is available at the web-page noted above under the name floridashark.

To analyze the data we first consider using a Poisson model as shark attacks are reasonably rare events. We can do this by applying a generalized linear model assuming the Poisson model using a log link to connect the Poisson mean μ with rate change over time. Dividing fitted means by population size yields a model for rates. To do this we introduce an offset term for the log of the population size on the right hand side of the model equation.

Here are the model fitting steps and a plot of the resultant fit. Note that the time trend follows an exponential pattern rather than linear, due to the use of the log link.

poisRegFit <- glm( Attacks ~ offset(log(Pop100Thou)) + I(Year-1946), family=poisson, data=sharks)
plot(sharks$Year,sharks$Rate,xlab="Year", ylab="Rate")
predRate <- predict(poisRegFit,type="response")/(sharks$Pop100Thou)
lines(sharks$Year,predRate,col=2)

Here is the default summary output for the generalized model plus a transformed version. Note that the both summaries indicate that the rate of shark attacks is rising by approximately 3% a year according to the model.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + I(Year - 1946), 
##     family = poisson, data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.614831   0.178844 -20.212  < 2e-16 ***
## I(Year - 1946)  0.031174   0.004361   7.148  8.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: 288.76
## 
## Number of Fisher Scoring iterations: 5
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0269215 0.1788439 -20.212212 0 0.018763 0.0378397
I(Year - 1946) 1.0316649 0.0043612 7.148051 0 1.023034 1.0406839
fits <- predict(poisRegFit,type="response")
residSquared <-  (sharks$Attacks -  fits)^2

Overdispersion

Here are the summaries for a quasiposson fit.

## 
## Call:
## glm(formula = Attacks ~ offset(log(Pop100Thou)) + Year, family = quasipoisson, 
##     data = sharks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1470  -1.2001  -0.3177   0.7281   3.4856  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -64.279344  13.021371  -4.936 8.60e-06 ***
## Year          0.031174   0.006559   4.753 1.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.26181)
## 
##     Null deviance: 176.93  on 53  degrees of freedom
## Residual deviance: 119.11  on 52  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.000000 13.0213706 -4.93645 8.60e-06 0.000000 0.000000
Year 1.031665 0.0065589 4.75291 1.62e-05 1.018819 1.045388

Analyzing Aggregated Counts

Data was collected by Doll and Hill in a ground-breaking follow-up study of smoking habits and lung cancer incidence in a cohort of British doctors. The data was aggregated by smoking status and 10 year age category. During the course of the study the physicians would age across category boundaries, so a person years approach was taken to determine denominators for rate calculations.

British Doctor’s Study
non-Smokers
Smokers
deaths pyears rate/10^4 pyears deaths pyears rate/10^4 pyears RR
35-44 2 18790 1.064396 32 52407 6.106054 5.74
45-54 12 10673 11.243324 104 43248 24.047355 2.14
55-64 28 5710 49.036778 206 28612 71.997763 1.47
65-74 28 2585 108.317215 186 12663 146.884625 1.36
75-84 31 1462 212.038304 102 5317 191.837502 0.90

Analyzing rates with Poisson regression, a.k.a. log-linear regression

llfit <- glm(deaths ~ offset(log(pyears)) + smokec + agec, family=poisson, data=df3)
kableN(tidy(llfit,expo=TRUE,conf.int=TRUE))
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.0003636 0.1917612 -41.297843 0.0000000 0.0002454 0.0005213
smokecyes 1.4255185 0.1073740 3.301875 0.0009604 1.1608784 1.7692033
agec45-54 4.4105836 0.1951028 7.606281 0.0000000 3.0455744 6.5603262
agec55-64 13.8391996 0.1837267 14.301161 0.0000000 9.7999085 20.1835186
agec65-74 28.5167828 0.1847986 18.130508 0.0000000 20.1458266 41.6671774
agec75-84 40.4512058 0.1922190 19.249382 0.0000000 28.1114400 59.8685311
anova(llfit,test="Chi")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: deaths
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       9     935.07              
## smokec  1    29.09         8     905.98 6.905e-08 ***
## agec    4   893.84         4      12.13 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The residual deviance is statisticially significant, indicating presence of an interaction. Rather than using up all the degrees of freedom, we note a decreasing trend in the relative risks by age group that suggests using a one degree of freedom interaction component

df3 <- within(df3, xact1df <- as.integer(agec)* as.integer(smokec))
llfit <- glm(deaths ~ offset(log(pyears)) + agec + smokec + xact1df,
    family=poisson, data=df3)
summary(llfit)
## 
## Call:
## glm(formula = deaths ~ offset(log(pyears)) + agec + smokec + 
##     xact1df, family = poisson, data = df3)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.87734   0.20744   0.61999  -0.43806   0.00608   0.27117  -0.06891  
##        8         9        10  
## -0.21863   0.17544  -0.00335  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.27687    0.23916 -34.609  < 2e-16 ***
## agec45-54    2.04327    0.26781   7.630 2.35e-14 ***
## agec55-64    3.76578    0.41228   9.134  < 2e-16 ***
## agec65-74    5.06816    0.58228   8.704  < 2e-16 ***
## agec75-84    5.96584    0.74648   7.992 1.33e-15 ***
## smokecyes    1.44494    0.37289   3.875 0.000107 ***
## xact1df     -0.30873    0.09727  -3.174 0.001503 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 935.067  on 9  degrees of freedom
## Residual deviance:   1.546  on 3  degrees of freedom
## AIC: 70.614
## 
## Number of Fisher Scoring iterations: 4

The residual deviance indicates a satisfactory fit. Comparing observed and fitted age-specific relative risks we see:

Observed Fitted
35-44 5.74 3.11
45-54 2.14 2.29
55-64 1.47 1.68
65-74 1.36 1.23
75-84 0.90 0.91

Parametric models for Overdispersion

Negative binomial models

  • Negative binomial can be derived as a mixture model.
  • Mixture model is constructed from a fixed parameter model by letting a parameter vary randomly.
    • e.g. Poisson stopped binomial, Y Binomial(N,p) where N Poisson(μ)
    • result is just another Poisson !!
    • negative binomial arises from Poisson by letting μ follow a Γ (Gamma) distribution
    • varying parameterizations of Γ give rise to different forms of negative binomial

Negative binomial, type 2

f(y|μ,α)=Γ(y+α1)Γ(y+1)Γ(α1)(α1α1+μ)a(μα1+μ)yα0,y=0,1,2,
  • formalism above not too helpful, but yields E(Y)=μ and Var(Y)=μ+αμ2

Negative binomial, type 1

Changing the parameterization of the Γ distribution leads to a negative binomial distribution for Y, with E(Y)=μ and Var(Y)=(1+α)μ.

  • Note that letting ϕ=(1+α) yields Var(Y)=ϕμ
    • i.e. is overdispersion parameter, since Var(Y)=μ same as Poisson when ϕ=1

Example

library(MASS)
negbinRegFit <- glm.nb( Attacks ~ offset(log(Pop100Thou)) + Year, data=sharks)
summary(negbinRegFit)
## 
## Call:
## glm.nb(formula = Attacks ~ offset(log(Pop100Thou)) + Year, data = sharks, 
##     init.theta = 5.429471356, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6803  -0.8321  -0.2452   0.4347   1.9864  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -58.22912   12.35427  -4.713 2.44e-06 ***
## Year          0.02812    0.00624   4.506 6.60e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.4295) family taken to be 1)
## 
##     Null deviance: 84.819  on 53  degrees of freedom
## Residual deviance: 61.274  on 52  degrees of freedom
## AIC: 270.14
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.43 
##           Std. Err.:  2.25 
## 
##  2 x log-likelihood:  -264.137

Negative Binomial Models