An example of count data analysis
Gilbert Berdine MD ^{a}, Shengping Yang PhD^{b}
Correspondence to Gilbert Berdine MD
Email: Gilbert.Berdine@ttuhsc.edu
SWRCCC 2015;3(11):5559
doi:10.12746/swrccc2015.0310.149
Previously, we have introduced Poisson and Negative binomial regressions for modeling count data. Here we will use a real example to demonstrate how to use SAS software performing such analyses.
A new oral antibiotic drug Gorilacillin was developed and has had excellent effects in several clinical trials. Gorilacillin has two side effects, including rash and elevated liver function tests (ELFT). To evaluate whether different patient groups have different risks of having such side effects, a Gorilacillin side effect study was conducted. A total of 5,275 participants were recruited from five participating countries. All the patients were followed for up to one month, and the number of patients who developed rash or ELFT was recorded. The goal of the study was to investigate whether there was a significant difference in developing side effects for patients in different age groups.
Data from the study were collected and stored in an Excel table (see below for a partial view of the table).
Country 
Age 
# Patients 
# Rash 
# ELFT 
Great Britain 
(04) 
65 
1 
0 
Great Britain 
(59) 
18 
0 
0 
Great Britain 
(1014) 
229 
1 
1 
Great Britain 
(1519) 
59 
1 
1 
Great Britain 
(2024) 
65 
0 
0 
Great Britain 
(2529) 
49 
2 
0 
… 




It is typical to use a Poisson or negative binomial regression for analyzing such data since the outcome is a side effect count, and the probability for developing side effects is low – rare events. Meanwhile, because rash and ELFT are the two main side effects of Gorilacillin, an event is declared if a patient develops either rash or ELFT. Note that the numbers of patients in different age groups/countries are different; thus it makes sense to model the rate of side effect per patient, as a function of age group and country, to adjust for the differences in number of patients.
A scatter plot can usually help visualize potential relationships. As we can see from Figure 1, there does not seem to have a strong relationship between side effect rate and age group. Also the side effect rates are quite similar among countries.
To apply a Poisson regression, we have,
To adjust for the numbers of patients in different age groups/countries, we use the rate of side effect (dividing the expected number of events by the number of patients) as the outcome variable. Equivalently, the above equation can also be written as,
,
where the additional term on the righthand side, log(n), is called an offset. Corresponding to the above two models, there are two equivalent SAS statements:
proc genmod data=data;
class country age;
model incident/n = age country / dist= poisson link=log;
lsmeans age / ilink diff cl;
run;
Or equivalently,
proc genmod data=data;
class country age;
model incident = age country / offset=logn dist=poisson link=log;
lsmeans age / ilink diff cl;
run;
Note that, in the second equation, logn=log(n), where n is the number of patients in a specific group. The lsmeans statement can be used to obtain the side effect rate estimates for the 10 age groups, averaged over countries. The ilink option specifies the inverse link function to be used for calculating the rate estimates, and the cl option produces the confidence intervals. In addition, the diff option provides all pairwise comparisons of side effect rates among age groups.
Analysis Of Maximum Likelihood Parameter Estimates 


Parameter 

DF 
Estimate 
Standard Error 
Wald 95% Confidence Limits 
Wald ChiSquare 
Pr > ChiSq 

Intercept 

1 
3.8135 
0.3599 
4.5189 
3.1081 
112.28 
<.0001 
Age 
(04) 
1 
0.5275 
0.4444 
1.3984 
0.3435 
1.41 
0.2352 
Age 
(1014) 
1 
0.8328 
0.6796 
2.1647 
0.4992 
1.50 
0.2204 
Age 
(1519) 
1 
0.0258 
0.3575 
0.6749 
0.7266 
0.01 
0.9424 
Age 
(2024) 
1 
0.3447 
0.4277 
1.1830 
0.4936 
0.65 
0.4203 
Age 
(2529) 
1 
0.6658 
0.4733 
0.2618 
1.5934 
1.98 
0.1595 
Age 
(3034) 
1 
0.4745 
0.5735 
1.5986 
0.6495 
0.68 
0.4080 
Age 
(3539) 
1 
0.3067 
0.6396 
1.5603 
0.9469 
0.23 
0.6316 
Age 
(4044) 
1 
0.4021 
0.5016 
1.3853 
0.5810 
0.64 
0.4227 
Age 
(4549) 
1 
0.9098 
0.5660 
2.0192 
0.1995 
2.58 
0.1080 
Age 
(59) 
0 
0.0000 
0.0000 
0.0000 
0.0000 
. 
. 
country 
Great Britain 
1 
0.1291 
0.4409 
0.9931 
0.7350 
0.09 
0.7697 
country 
India 
1 
0.2636 
0.3141 
0.8792 
0.3520 
0.70 
0.4013 
country 
Japan 
1 
0.1973 
0.3572 
0.8974 
0.5027 
0.31 
0.5806 
country 
Turkey 
1 
0.1805 
0.4629 
1.0877 
0.7268 
0.15 
0.6967 
country 
United States 
0 
0.0000 
0.0000 
0.0000 
0.0000 
. 
. 
Scale 

0 
1.0000 
0.0000 
1.0000 
1.0000 


The above SAS output table shows that age group was not significantly associated with Gorilacillin side effect rate and there was no significant difference among the 5 countries.
Age Least Squares Means 


Age 
Estimate 
Standard Error 
z Value 
Pr > z 
Alpha 
Lower 
Upper 
Mean 
Standard Error 
Lower Mean 
Upper Mean 
(04) 
4.4951 
0.3556 
12.64 
<.0001 
0.05 
5.1921 
3.7981 
0.01116 
0.003970 
0.005560 
0.02241 
(59) 
3.9676 
0.2778 
14.28 
<.0001 
0.05 
4.5121 
3.4231 
0.01892 
0.005256 
0.01098 
0.03261 
(1014) 
4.8004 
0.6035 
7.95 
<.0001 
0.05 
5.9832 
3.6175 
0.008227 
0.004965 
0.002521 
0.02685 
(1519) 
3.9418 
0.2602 
15.15 
<.0001 
0.05 
4.4517 
3.4319 
0.01941 
0.005051 
0.01166 
0.03233 
(2024) 
4.3123 
0.3369 
12.80 
<.0001 
0.05 
4.9726 
3.6520 
0.01340 
0.004516 
0.006925 
0.02594 
(2529) 
3.3018 
0.3809 
8.67 
<.0001 
0.05 
4.0483 
2.5553 
0.03682 
0.014020 
0.01745 
0.07767 
(3034) 
4.4421 
0.5144 
8.64 
<.0001 
0.05 
5.4503 
3.4340 
0.01177 
0.006055 
0.004295 
0.03226 
(3539) 
4.2743 
0.5877 
7.27 
<.0001 
0.05 
5.4261 
3.1224 
0.01392 
0.008182 
0.004400 
0.04405 
(4044) 
4.3698 
0.4182 
10.45 
<.0001 
0.05 
5.1894 
3.5501 
0.01265 
0.005292 
0.005575 
0.02872 
(4549) 
4.8775 
0.5061 
9.64 
<.0001 
0.05 
5.8693 
3.8856 
0.007616 
0.003854 
0.002825 
0.02054 
From the lsmeans estimates, we see that the estimated side effect rate for patents 04 years old was 1.1% (the Mean column; table above) with a confidence interval (0.6%, 2.2%; the Lower Mean and Upper Mean columns), and for the 59 years old it was 1.9% with a confidence interval (1.1%, 3.3%), etc.
The diff option does provide all pairwise comparisons should such comparisons be of interest (table below shows part of the comparisons).
Differences of Age Least Squares Means 


Age 
_Age 
Estimate 
Standard Error 
z Value 
Pr > z 
Alpha 
Lower 
Upper 
(04) 
(1014) 
0.3053 
0.7116 
0.43 
0.6679 
0.05 
1.0894 
1.7000 
(04) 
(1519) 
0.5533 
0.4237 
1.31 
0.1916 
0.05 
1.3837 
0.2771 
(04) 
(2024) 
0.1828 
0.4870 
0.38 
0.7074 
0.05 
1.1372 
0.7716 
(04) 
(2529) 
1.1933 
0.5264 
2.27 
0.0234 
0.05 
2.2250 
0.1616 
(04) 
(3034) 
0.05294 
0.6021 
0.09 
0.9299 
0.05 
1.2330 
1.1271 
(04) 
(3539) 
0.2208 
0.6694 
0.33 
0.7415 
0.05 
1.5329 
1.0912 
(04) 
(4044) 
0.1253 
0.5363 
0.23 
0.8152 
0.05 
1.1764 
0.9257 
(04) 
(4549) 
0.3824 
0.6129 
0.62 
0.5327 
0.05 
0.8189 
1.5837 
(04) 
(59) 
0.5275 
0.4444 
1.19 
0.2352 
0.05 
1.3984 
0.3435 
(1014) 
(1519) 
0.8586 
0.6681 
1.29 
0.1988 
0.05 
2.1681 
0.4509 
… 








… 
… 
… 
… 
… 
… 
… 
… 
Now, recall that we previously explained that a negative binomial regression model might be more appropriate should data overdispersion exist. To test overdispersion, an easy way is to apply a negative binomial regression with scale=0 and noscale options in the model statement. These options test whether overdispersion of the form μ+kμ^2 exists by testing whether the dispersions parameter equals to 0.
proc genmod data=data;
class country age;
model incident/n = age country / dist= nb link=log scale=0 noscale;
run;
Lagrange Multiplier Statistics 


Parameter 
ChiSquare 
Pr > ChiSq 

Dispersion 
8309.4881 
<.0001* 

* Onesided pvalue 
From the above test of overdispersion result, we can see that the p value is less than 0.0001, and thus it is appropriate to use a negative binomial regression.
Analysis Of Maximum Likelihood Parameter Estimates 


Parameter 

DF 
Estimate 
Standard Error 
Wald 95% Confidence Limits 
Wald ChiSquare 
Pr > ChiSq 

Intercept 

1 
3.6458 
0.3249 
4.2826 
3.0091 
125.94 
<.0001 
Age 
(04) 
1 
0.5332 
0.4053 
1.3276 
0.2612 
1.73 
0.1883 
Age 
(1014) 
1 
0.6927 
0.5926 
1.8542 
0.4688 
1.37 
0.2424 
Age 
(1519) 
1 
0.1166 
0.3365 
0.7761 
0.5429 
0.12 
0.7289 
Age 
(2024) 
1 
0.4504 
0.3971 
1.2287 
0.3279 
1.29 
0.2567 
Age 
(2529) 
1 
0.6895 
0.4348 
0.1627 
1.5417 
2.51 
0.1128 
Age 
(3034) 
1 
0.4385 
0.5165 
1.4509 
0.5738 
0.72 
0.3959 
Age 
(3539) 
1 
0.4003 
0.4448 
0.4715 
1.2721 
0.81 
0.3681 
Age 
(4044) 
1 
0.3026 
0.4460 
1.1768 
0.5715 
0.46 
0.4974 
Age 
(4549) 
1 
0.3803 
0.4186 
1.2009 
0.4402 
0.83 
0.3636 
Age 
(59) 
0 
0.0000 
0.0000 
0.0000 
0.0000 
. 
. 
country 
Great Britain 
1 
0.1721 
0.4080 
0.9717 
0.6275 
0.18 
0.6731 
country 
India 
1 
0.2972 
0.2858 
0.8573 
0.2629 
1.08 
0.2983 
country 
Japan 
1 
0.0361 
0.3020 
0.5558 
0.6280 
0.01 
0.9048 
country 
Turkey 
1 
0.1599 
0.4131 
0.9695 
0.6498 
0.15 
0.6987 
country 
United States 
0 
0.0000 
0.0000 
0.0000 
0.0000 
. 
. 
Dispersion 

1 
1.0492 
0.0141 
1.0219 
1.0773 


The result from the negative binomial regression (table above) is similar to that from the Poisson regression. We did not detect any difference in side effect rate between the reference and other age groups. Looking at the raw data in the scatter plot (Figure 1), one might think that Americans of age 2529 were at risk for adverse effects of the drug, but the statistical analysis shows the result to be within the 95% confidence limit for purely random effect compared to the reference group. The American 2529 data point appears, at first glance, to be an outlier with some nonrandom effect, but, in fact, it is a purely random walk from the other data points.
The statistical analysis is consistent with the reality of the situation. Gorilacillin does not exist and the data was simulated by sampling rare occurrence events from an online game in which the game developers assure us that the events are, indeed, random. The game has generated all sorts of “theories” about how to elicit these rare events more often, but the statistical analysis shows the “theories” to be no more substantial than Americans age 2529. This example illustrates how rare events can seem to generate “outliers” that are merely results of small samples and rare occurrence rates.
Many times, rare events are hard to observe, and it might take quite some time before one event is observed. If feasible, one alternative strategy of studying an association between a rare event and potential risk factors is to collect data retrospectively. For example, identify the list of patients who had the event, match them with those who did not have the event, then collect all the necessary data and perform data analysis.
...................................................................................................................................................................................................................................................................................................................................
Published electronically: 7/15/2015
Conflict of Interest Disclosures: none
Refbacks
 There are currently no refbacks.
ISSN: 23259205