.. currentmodule:: microscopes
Categorical Data and the Dirichlet Discrete Distribution
========================================================
--------------
Let's consider some examples of data with categorical variables
.. code:: python
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('talk')
sns.set_style('darkgrid')
First, the passenger list of the Titanic
.. code:: python
titanic = sns.load_dataset("titanic")
.. code:: python
titanic.head(n=10)
.. raw:: html
|
survived |
pclass |
sex |
age |
sibsp |
parch |
fare |
embarked |
class |
who |
adult_male |
deck |
embark_town |
alive |
alone |
0 |
0 |
3 |
male |
22 |
1 |
0 |
7.2500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
False |
1 |
1 |
1 |
female |
38 |
1 |
0 |
71.2833 |
C |
First |
woman |
False |
C |
Cherbourg |
yes |
False |
2 |
1 |
3 |
female |
26 |
0 |
0 |
7.9250 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
True |
3 |
1 |
1 |
female |
35 |
1 |
0 |
53.1000 |
S |
First |
woman |
False |
C |
Southampton |
yes |
False |
4 |
0 |
3 |
male |
35 |
0 |
0 |
8.0500 |
S |
Third |
man |
True |
NaN |
Southampton |
no |
True |
5 |
0 |
3 |
male |
NaN |
0 |
0 |
8.4583 |
Q |
Third |
man |
True |
NaN |
Queenstown |
no |
True |
6 |
0 |
1 |
male |
54 |
0 |
0 |
51.8625 |
S |
First |
man |
True |
E |
Southampton |
no |
True |
7 |
0 |
3 |
male |
2 |
3 |
1 |
21.0750 |
S |
Third |
child |
False |
NaN |
Southampton |
no |
False |
8 |
1 |
3 |
female |
27 |
0 |
2 |
11.1333 |
S |
Third |
woman |
False |
NaN |
Southampton |
yes |
False |
9 |
1 |
2 |
female |
14 |
1 |
0 |
30.0708 |
C |
Second |
child |
False |
NaN |
Cherbourg |
yes |
False |
One of the categorical variables in this dataset is ``embark_town``
Let's plot the number of passengers departing from each town
.. code:: python
ax = titanic.groupby(['embark_town'])['age'].count().plot(kind='bar')
plt.xticks(rotation=0)
plt.xlabel('Departure Town')
plt.ylabel('Passengers')
plt.title('Number of Passengers by Town of Departure')
.. parsed-literal::
.. image:: dirichlet-discrete_files/dirichlet-discrete_6_1.png
Let's look at another example: the `cars93
dataset `__
.. code:: python
cars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/Cars93.csv', index_col=0)
.. code:: python
cars.head()
.. raw:: html
|
Manufacturer |
Model |
Type |
Min.Price |
Price |
Max.Price |
MPG.city |
MPG.highway |
AirBags |
DriveTrain |
... |
Passengers |
Length |
Wheelbase |
Width |
Turn.circle |
Rear.seat.room |
Luggage.room |
Weight |
Origin |
Make |
1 |
Acura |
Integra |
Small |
12.9 |
15.9 |
18.8 |
25 |
31 |
None |
Front |
... |
5 |
177 |
102 |
68 |
37 |
26.5 |
11 |
2705 |
non-USA |
Acura Integra |
2 |
Acura |
Legend |
Midsize |
29.2 |
33.9 |
38.7 |
18 |
25 |
Driver & Passenger |
Front |
... |
5 |
195 |
115 |
71 |
38 |
30.0 |
15 |
3560 |
non-USA |
Acura Legend |
3 |
Audi |
90 |
Compact |
25.9 |
29.1 |
32.3 |
20 |
26 |
Driver only |
Front |
... |
5 |
180 |
102 |
67 |
37 |
28.0 |
14 |
3375 |
non-USA |
Audi 90 |
4 |
Audi |
100 |
Midsize |
30.8 |
37.7 |
44.6 |
19 |
26 |
Driver & Passenger |
Front |
... |
6 |
193 |
106 |
70 |
37 |
31.0 |
17 |
3405 |
non-USA |
Audi 100 |
5 |
BMW |
535i |
Midsize |
23.7 |
30.0 |
36.2 |
22 |
30 |
Driver only |
Rear |
... |
4 |
186 |
109 |
69 |
39 |
27.0 |
13 |
3640 |
non-USA |
BMW 535i |
5 rows × 27 columns
.. code:: python
cars.ix[1]
.. parsed-literal::
Manufacturer Acura
Model Integra
Type Small
Min.Price 12.9
Price 15.9
Max.Price 18.8
MPG.city 25
MPG.highway 31
AirBags None
DriveTrain Front
Cylinders 4
EngineSize 1.8
Horsepower 140
RPM 6300
Rev.per.mile 2890
Man.trans.avail Yes
Fuel.tank.capacity 13.2
Passengers 5
Length 177
Wheelbase 102
Width 68
Turn.circle 37
Rear.seat.room 26.5
Luggage.room 11
Weight 2705
Origin non-USA
Make Acura Integra
Name: 1, dtype: object
This dataset has multiple categorical variables
Based on the description of the cars93 datatset, we'll consider
``Manufacturer``, and ``DriveTrain`` to be categorical variables
Let's plot ``Manufacturer`` and ``DriveTrain``
.. code:: python
cars.groupby('Manufacturer')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Manufacturer')
.. parsed-literal::
.. image:: dirichlet-discrete_files/dirichlet-discrete_12_1.png
.. code:: python
cars.groupby('DriveTrain')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Drive Train')
.. parsed-literal::
.. image:: dirichlet-discrete_files/dirichlet-discrete_13_1.png
If our categorical data has labels, we need to convert them to integer
id's
.. code:: python
def col_2_ids(df, col):
ids = df[col].drop_duplicates().sort(inplace=False).reset_index(drop=True)
ids.index.name = '%s_ids' % col
ids = ids.reset_index()
df = pd.merge(df, ids, how='left')
del df[col]
return df
.. code:: python
cat_columns = ['Manufacturer', 'DriveTrain']
for c in cat_columns:
print c
cars = col_2_ids(cars, c)
.. parsed-literal::
Manufacturer
DriveTrain
.. code:: python
cars[['%s_ids' % c for c in cat_columns]].head()
.. raw:: html
|
Manufacturer_ids |
DriveTrain_ids |
0 |
0 |
1 |
1 |
0 |
1 |
2 |
1 |
1 |
3 |
1 |
1 |
4 |
2 |
2 |
Just as we model binary data with the beta Bernoulli distribution, we
can model categorical data with the Dirichlet discrete distribution
The beta Bernoulli distribution allows us to learn the underlying
probability, :math:`\theta`, of the binary random variable, :math:`x`
.. math:: P(x=1) =\theta
.. math:: P(x=0) = 1-\theta
The Dirichlet discrete distribution extends the beta Bernoulli
distribution to the case in which :math:`x` can assume more than two
states
.. math:: \forall i \in [0,1,...n] \hspace{2mm} P(x = i) = \theta_i
.. math:: \sum_{i=0}^n \theta_i = 1
Again, the Dirichlet distribution takes advantage of the fact that the
Dirichlet distribution and the discrete distribution are conjugate. Note
that the discrete distriution is sometimes called the categorical
distribution or the multinomial distribution.
To import the Dirichlet discrete distribution call
.. code:: python
from microscopes.models import dd as dirichlet_discrete
Then given the specific model we'd want we'd import
``from microscopes.model_name.definition import model_definition``
.. code:: python
from microscopes.irm.definition import model_definition as irm_definition
from microscopes.mixture.definition import model_definition as mm_definition
See ``Defining Your Model`` to learn more about model definitions