Plus a Python tutorial on training a ZIP model on data sets having excess zeroes
Published in · 12 min read · May 2, 2020
In this article, we’ll learn how to build a regression model for counts based datasets in which the dependent variable contains an excess of zero-valued data.
Counts datasets are ones where the dependent variable is an event such as:
- Number of vehicles crossing an intersection per hour.
- Number of ER visits happening each month
- Number of motor vehicle insurance claims filed per year
- Number of defects found in a mass produced printed circuit board.
Many real world phenomena produce counts that are almost always zero. For example:
- Number of times a machine fails each month
- Number of exoplanets discovered each year
- The number of billionaires living in every single city in the world.
Such data are hard to deal with using traditional models for counts data such as the Poisson, the Binomial or the Negative Binomial regression models.
This is because such data sets contain more number of zero valued counts than what one would expect to observe using the traditional model’s probability distribution.
For example, if you assume that a phenomenon obeys the following Poisson(5) process, you would expect to see zero counts no more than 0.67% of the time:
If you observe zero counts far more often than that, the data set contains an excess of zeroes.
If you use a standard Poisson or Binomial or NB regression model on such data sets, it can fit badly and will generate poor quality predictions, no matter how much you tweak its parameters.
So what is a modeler to do when faced with such data with excess zeros?
The Zero Inflated Poisson Regression model
Fortunately, there is a way to modify a standard counts model such as Poisson or Negative Binomial to account for the presence of the extra zeroes. In fact, there happen to be at least two ways to do this. One technique is known as the Hurdle model and the second technique is known the Zero-Inflated model.
In this article, we’ll look at the zero-inflated regression model in some detail. Specifically, we’ll focus on the Zero Inflated Poisson regression model, often referred to as the ZIP model.
The structure of a ZIP model
Let’s briefly look at the structure of a regular Poisson model before we see how its structure is modified to handle excess zero counts.
Imagine a data set containing n samples and p regression variables per sample. Therefore, the regression variables X can be represented by a matrix of size (n x p) and each row x_i in the X matrix is a vector of size (1 x p) corresponding the dependent variable value y_i:
If we assume that y is a Poisson distributed random variable, we can build a Poisson regression model for this data set. The Poisson model is made up of two parts:
- A Poisson Probability Mass Function (PMF) denoted as P(y_i=k) used to calculate the probability of observing k events in any unit interval given a mean event rate of λ events / unit time.
- A link function that is used to express the mean rate λ as a function of the regression variables X.
This is illustrated in the figure below:
Normally, we assume that there is some underlying process that is producing the observed counts as per the Poisson PMF: P(y_i=k).
The intuition behind the Zero Inflated Poisson model is that there is a second underlying process that is determining whether a count is zero or non-zero. Once a count is determined to be non-zero, the regular Poisson process takes over to determine its actual non-zero value based on the Poisson process’s PMF.
Thus, a ZIP regression model consists three parts:
- A PMF P(y_i=0) which is used to calculate the probability of observing a zero count.
- A second PMF P(y_i=k) which is used to calculate the probability of observing k events, given that k > 0.
- A link function that is used to express the mean rate λ as a function of the regression variables X.
This is illustrated in the following figure:
As before, y_i is the random variable that denotes the observed count corresponding to the regression variables row x_i=[x_i1, x_i2, x_i3,…,x_ip].
ϕ_i is a measure of the proportion of excess zeroes corresponding to the ith row (y_i, x_i) in the data set; .
Getting to know ϕ_i
A simple way to understand ϕ_i is as follows:
Imagine that you take 1000 observations of y_i, each one with the same combination of regression variable values x_i=[x_i1, x_i2, x_i3,…,x_ip]. Since y_i is a random variable that follows the Poisson distribution, you may see a different value of y_i in each one of the 1000 observations.
Suppose that out of the 1000 y_i values you observe, you observe 874 zero values. You determine that out of these 874 zero values, the regular Poisson distribution that you have assumed for y_i, will be able to explain only up to 7 zero values. So the remaining 867 zero values are excess zero observations. So for the ith row in your data set, ϕ_i =867/1000 = 0.867.
When the data set does not have any excess zeroes in the dependent variable, the value of ϕ works out to be zero and the PMF of the ZIP model reduces to the PMF of the standard Poisson model (you can easily verify this by setting ϕ to 0 in the ZIP model’s PMF).
How to estimate ϕ?
So how can we estimate the value of ϕ_i?
A simple and crude way of estimating ϕ_i is by setting each ϕ_i to the following ratio:
Perhaps a more realistic way of calculating ϕ_i is by estimating it as a function of regression variables X. This is usually done by transforming the y variable to a binary 0/1 random variable y’ (y_prime) which takes the value 0 if the underlying y is 0, and 1 in all other cases. Then we fit a Logistic regression model on the transformed y’. We then train the Logistic regression model on the data set [X, y’] and it yields a vector of fitted probabilities µ_fitted=[µ_1, µ_2, µ_3,…,µ_n], (because that’s what a Logistic regression model does).
Once we get the µ_fitted vector, we simply set it to the vector ϕ.
Thus [ϕ_1=µ_1, ϕ_2=µ_2, ϕ_3=µ_3,…,ϕ_n=µ_n].
The above process of estimating ϕ is illustrated below:
Once the ϕ vector is estimated, we plug it into the probability functions of the ZIP model and use what is known as the Maximum Likelihood Estimation (MLE) technique to train the ZIP model on the data set with excess counts.
Please see my article on Poisson Regression Model for an explanation of how MLE works.
The following figure illustrates the training sequence of the ZIP model:
Thankfully, there are many statistics packages that automate this entire procedure of estimating ϕ and using the estimated ϕ to train the ZIP model using the MLE technique on your data set.
In the rest of this article, we’ll use the Python statsmodels library to build and train a ZIP model in a single line of code.
How to train the ZIP model using Python
In our Python tutorial on the ZIP model, we’ll use a data set of camping trips taken by 250 groups of people:
The data set is available here. Here are a couple of salient features of this data set:
- The campers may or may not have done some fishing during their trip.
- If a group did some fishing, they would have caught zero or mor fish.
- We want to estimate not only how many fish were caught (if there was fishing done by a camping group), but also the probability that the camping group caught any fish at all.
Thus, there are two distinct data generation processes involved:
- A process that determines whether or not a camping group indulged in a successful fishing activity: The ZIP model will internally use a Logistic Regression model that was explained earlier to model this binary process.
- A second process that determines how many fish were caught by a camping group, given that there was at least one fish caught by the group: The ZIP model will use a regular Poisson model for modeling this second process.
Variables in the data set
The camping trips data set contains the following variables:
FISH_COUNT: The number of fish that were caught. This will be our dependent variable y.
LIVE_BAIT: A binary variable indicating whether live bait was used.
CAMPER: Whether the fishing group used a camper van.
PERSONS: Total number of people in the fishing group. Note that in some groups, none of them may have fished.
CHILDREN: The number of children in the camping group.
Here is a frequency distribution of the dependent FISH_COUNT variable:
As we can see, there may be excess zeroes in this data set. We’ll train a ZIP model on this data set to test this theory and hopefully achieve a better fit than the regular Poisson model.
Regression Goal
Our regression goals on this data set are as follows:
Predict the number of fish caught (FISH_COUNT) by a camping group based on the values of LIVE_BAIT, CAMPER, PERSONS and CHILDREN variables.
Regression Strategy
Our regression strategy will be as follows:
- FISH_COUNT will be the dependent variable y, and [LIVE_BAIT, CAMPER, PERSONS and CHILDREN] will be the explanatory variables X.
- We’ll use the Python statsmodels library to train the ZIP regression model on the (y, X) data set.
- We’ll make some predictions using the ZIP model on a test data set that the model has not seen during its training.
Let’s begin by import all the required packages:
import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
Next, we’ll load the fish data set into memory. Here is the link to the data set:
df = pd.read_csv('fish.csv', header=0)
Let’s print the top few rows of the data set:
print(df.head(10))
Let’s also print out the frequency distribution of FISH_COUNT values:
df.groupby('FISH_COUNT').count()
Create the training and test data sets. Note that for now, we are not doing a stratified random split:
mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))>> Training data set length=196
>> Testing data set length=54
Setup the regression expression in Patsy notation. We are telling Patsy that FISH_COUNT is our dependent variable y and it depends on the regression variables LIVE_BAIT, CAMPER, PERSONS and CHILDREN:
expr = 'FISH_COUNT ~ LIVE_BAIT + CAMPER + CHILDREN + PERSONS'
Let’s use Patsy to carve out the X and y matrices for the training and testing data sets.
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
Using statsmodels’s ZeroInflatedPoisson class, let’s build and train a ZIP regression model on the training data set.
But before we do so, let me explain how to use two parameters that the class constructor takes:
- inflation: The ZeroInflatedPoisson model class will internally use a LogisticRegression model to estimate the parameter ϕ. Hence we set the model parameter inflation to ’logit’. We can also experiment with setting it to other Binomial link functions such as ‘probit’.
- exog_infl: We also want to ask the ZIP model to estimate ϕ as a function of the same set of regression variables as the parent model, namely: LIVE_BAIT, CAMPER, PERSONS and CHILDREN. Hence we set the parameter exog_infl to X_train. If you want to use only a subset of X_train, you can do so, or you can set exog_infl to an entirely different set of regression variables.
The below line builds and trains the ZIP model on our training data set in a single line of code.
zip_training_results = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit').fit()
Print the training summary:
print(zip_training_results.summary())
Here is the training summary (I have highlighted the important elements in the output):
Interpreting the training output
The blue box contains information about variables that the nested Logistic Regression model has used to estimate the probability ϕ of whether or not any fish were caught by a camping group.
Notice that the Logistic regression model did not find Intercept, LIVE_BAIT and CAMPER variables useful for estimating ϕ. Their regression coefficients were found to be NOT statistically significant at the 95% confidence level, as indicated by the respective p values:
inflate_Intercept=0.425,
inflate_LIVE_BAIT=0.680 and
inflate_CAMPER=0.240,
which are all greater than 0.05 (i.e. 5% error threshold).
Observation 1
The only two variables that the Logistic Regression model determined as useful for estimating the probability of whether or not any fish were caught were CHILDREN and PERSONS.
Observation 2
The regression coefficient of PERSONS is negative (inflate_PERSONS -1.2193) which means that as the number of people in the camping group increases, probability of no fish being caught by that group decreases. This is in line with our intuition.
The red box contains information about variables that the parent Poisson model used to estimate FISH_COUNT on the condition that FISH_COUNT > 0.
Observation 3
We can see that the coefficients for all 5 regression variables are statistically significant at a 99% confidence level, as evidenced by their p value which is less than 0.01. In fact, the p value is less than 0.001 for all 5 variables, hence it is showing up as 0.000.
Observation 4
The coefficient for CHILDREN is negative (CHILDREN -1.0810), meaning that as the number of children in the camping group goes up, the number of fish caught by that group goes down!
Observation 5
The Maximized Log-Likelihood of this model is -566.43. This value is useful for comparing the goodness-of-fit of the model with that of other models (See fun exercise at the end of the article).
Observation 6
Finally, note that the training algorithm of the ZIP model was not able to converge on the training data set as indicated by the following:
If it had converged, perhaps it would have resulted in a better fit.
Prediction
We’ll get the ZIP model’s predictions on the test data set and calculate the root mean square error w.r.t. the actual values:
zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test)predicted_counts=np.round(zip_predictions)actual_counts = y_test[dep_var]print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2)))))
>> ZIP RMSE=55.65069631190611
Let’s plot the predicted versus actual fish counts:
fig = plt.figure()fig.suptitle('Predicted versus actual counts using the ZIP model')predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted')actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual')plt.legend(handles=[predicted, actual])plt.show()
We see the following plot:
This completes our look at the Zero-Inflated Poisson regression model.
Thanks for reading! I write about topics in data science, with a specific focus on regression and time series analysis.
If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression and time series analysis.