.. _gamma_poisson: Count Data and Ordinal Data with the Gamma-Poisson Distribution =============================================================== Typically, we model count data, or integer valued data, with the gamma-Poisson distribution Recall that the Poisson distribution is a distribution over integer values parameterized by :math:`\lambda`. One interpretation behind :math:`\lambda` is that it parameterizes the rate at which events occur with a fixed interval, assuming these events occur independently. The gamma distribution is conjugate to the Poisson distribution, so the gamma-Poisson distribution allows us to learn both the distribution over counts and the rate parameter :math:`\lambda`. Let's set up our environment and consider some examples of count data .. code:: python import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt sns.set_context('talk') import csv import urllib2 import StringIO %matplotlib inline `Children Ever Born `__ is a dataset of birthrates in Fiji from the *World Fertility Survey* with the following columns: - ``dur``: marriage duration - ``res``: residence, - ``educ``: level of education, - ``mean``: mean number of born, - ``var``: variance of children born - ``y``: number of women Ordinal columns ``dur``, ``res``, and ``educ`` are shown as text in the following dataset .. code:: python ceb = pd.read_csv('http://data.princeton.edu/wws509/datasets/ceb.dat', sep='\s+') ceb.head() .. raw:: html
dur res educ mean var n y
1 0-4 Suva none 0.50 1.14 8 4.00
2 0-4 Suva lower 1.14 0.73 21 23.94
3 0-4 Suva upper 0.90 0.67 42 37.80
4 0-4 Suva sec+ 0.73 0.48 51 37.23
5 0-4 urban none 1.17 1.06 12 14.04
With the these columns encoded, we can now represent them as integers ``dur`` and ``educ`` are ordinal columns. Additionally, number of women, ``n``, is integer valued. .. code:: python ceb_int = pd.read_csv('http://data.princeton.edu/wws509/datasets/ceb.raw', sep='\s+', names = ['index'] + list(ceb.columns[:-1]), index_col=0) ceb_int.head() .. raw:: html
dur res educ mean var n
index
1 1 1 1 0.50 1.14 8
2 1 1 2 1.14 0.73 21
3 1 1 3 0.90 0.67 42
4 1 1 4 0.73 0.48 51
5 1 2 1 1.17 1.06 12
We can map these orderings of ``dur`` and ``educ`` to produce a crosstab heatmap of ``n``, numbe of women .. code:: python plt.figure(figsize=(9,6)) ct = pd.crosstab(ceb_int['dur'], ceb_int['educ'], values=ceb_int['n'], aggfunc= np.sum).sort_index(ascending = False) sns.heatmap(ct, annot = True) plt.yticks(ceb_int['dur'].drop_duplicates().values - .5, ceb['dur'].drop_duplicates().values) plt.xticks(ceb_int['educ'].drop_duplicates().values - .5, ceb['educ'].drop_duplicates().values) plt.ylabel('duration of marriage (years)') plt.xlabel('level of education') plt.title('heatmap of marriage duration by level of education') .. parsed-literal:: .. image:: gamma_poisson_files/gamma_poisson_7_1.png Since ``dur`` and ``education`` are ordinal valued, the columns assume a small number of integer values Additionally, the `caffeine dataset `__ below measures caffeine intake and performance on a 10 question quiz. The variables are: - ``coffee``: coffee intake (1 = 0 cups, 2 = 2 cups, 3 = 4 cups) - ``perf``: quiz score - ``numprob``: problems attempted - ``accur``: accuracy .. code:: python response = urllib2.urlopen('http://stanford.edu/class/psych252/_downloads/caffeine.csv') html = response.read() caf = pd.read_csv(StringIO.StringIO(html[:-16])) caf.head() .. raw:: html
coffee perf accur numprob
0 1 53 0.449877 7
1 1 9 0.499534 6
2 1 31 0.498590 6
3 1 38 0.454312 7
4 2 40 0.421212 8
Based on the characteristics of each column, ``coffee`` and ``numprob`` easily fit into the category of count data appropriate to a gamma-Poisson distribution .. code:: python caf.describe() .. raw:: html
coffee perf accur numprob
count 60.000000 60.000000 60.000000 60.000000
mean 2.000000 42.366667 0.510854 7.950000
std 0.823387 18.350603 0.107704 1.185005
min 1.000000 5.000000 0.240238 6.000000
25% 1.000000 31.000000 0.425859 7.000000
50% 2.000000 40.000000 0.509806 8.000000
75% 3.000000 53.500000 0.594445 9.000000
max 3.000000 89.000000 0.748692 10.000000
Note that while integer valued data with high values is sometimes modeled with a gamma-Poisson ditribution, remember that the gamma-Poisson distribution has equal mean and variance :math:`\lambda`: .. math:: E(X) = Var(X) = \lambda If you want to be more flexible with this assumption, you may want to consider using a normal inverse-chisquare or a normal inverse-Wishart distribution depending on your data To import the gamma-poisson likelihood, call: .. code:: python from microscopes.models import gp as gamma_poisson