Categorical Data and the Dirichlet Discrete Distribution¶

Let’s consider some examples of data with categorical variables

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context('talk')
sns.set_style('darkgrid')


First, the passenger list of the Titanic

titanic = sns.load_dataset("titanic")

titanic.head(n=10)

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
7 0 3 male 2 3 1 21.0750 S Third child False NaN Southampton no False
8 1 3 female 27 0 2 11.1333 S Third woman False NaN Southampton yes False
9 1 2 female 14 1 0 30.0708 C Second child False NaN Cherbourg yes False

One of the categorical variables in this dataset is embark_town

Let’s plot the number of passengers departing from each town

ax = titanic.groupby(['embark_town'])['age'].count().plot(kind='bar')
plt.xticks(rotation=0)
plt.xlabel('Departure Town')
plt.ylabel('Passengers')
plt.title('Number of Passengers by Town of Departure')

<matplotlib.text.Text at 0x1029b9a10>


Let’s look at another example: the cars93 dataset

cars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/Cars93.csv', index_col=0)

cars.head()

Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain ... Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front ... 5 177 102 68 37 26.5 11 2705 non-USA Acura Integra
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front ... 5 195 115 71 38 30.0 15 3560 non-USA Acura Legend
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front ... 5 180 102 67 37 28.0 14 3375 non-USA Audi 90
4 Audi 100 Midsize 30.8 37.7 44.6 19 26 Driver & Passenger Front ... 6 193 106 70 37 31.0 17 3405 non-USA Audi 100
5 BMW 535i Midsize 23.7 30.0 36.2 22 30 Driver only Rear ... 4 186 109 69 39 27.0 13 3640 non-USA BMW 535i

5 rows × 27 columns

cars.ix[1]

Manufacturer                  Acura
Model                       Integra
Type                          Small
Min.Price                      12.9
Price                          15.9
Max.Price                      18.8
MPG.city                         25
MPG.highway                      31
AirBags                        None
DriveTrain                    Front
Cylinders                         4
EngineSize                      1.8
Horsepower                      140
RPM                            6300
Rev.per.mile                   2890
Man.trans.avail                 Yes
Fuel.tank.capacity             13.2
Passengers                        5
Length                          177
Wheelbase                       102
Width                            68
Turn.circle                      37
Rear.seat.room                 26.5
Luggage.room                     11
Weight                         2705
Origin                      non-USA
Make                  Acura Integra
Name: 1, dtype: object


This dataset has multiple categorical variables

Based on the description of the cars93 datatset, we’ll consider Manufacturer, and DriveTrain to be categorical variables

Let’s plot Manufacturer and DriveTrain

cars.groupby('Manufacturer')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Manufacturer')

<matplotlib.text.Text at 0x114d9e6d0>

cars.groupby('DriveTrain')['Model'].count().plot(kind='bar')
plt.ylabel('Cars')
plt.title('Number of Cars by Drive Train')

<matplotlib.text.Text at 0x117554e50>


If our categorical data has labels, we need to convert them to integer id’s

def col_2_ids(df, col):
ids = df[col].drop_duplicates().sort(inplace=False).reset_index(drop=True)
ids.index.name = '%s_ids' % col
ids = ids.reset_index()
df = pd.merge(df, ids, how='left')
del df[col]
return df

cat_columns = ['Manufacturer', 'DriveTrain']

for c in cat_columns:
print c
cars = col_2_ids(cars, c)

Manufacturer
DriveTrain

cars[['%s_ids' % c for c in cat_columns]].head()

Manufacturer_ids DriveTrain_ids
0 0 1
1 0 1
2 1 1
3 1 1
4 2 2

Just as we model binary data with the beta Bernoulli distribution, we can model categorical data with the Dirichlet discrete distribution

The beta Bernoulli distribution allows us to learn the underlying probability, $$\theta$$, of the binary random variable, $$x$$

$P(x=1) =\theta$
$P(x=0) = 1-\theta$

The Dirichlet discrete distribution extends the beta Bernoulli distribution to the case in which $$x$$ can assume more than two states

$\forall i \in [0,1,...n] \hspace{2mm} P(x = i) = \theta_i$
$\sum_{i=0}^n \theta_i = 1$

Again, the Dirichlet distribution takes advantage of the fact that the Dirichlet distribution and the discrete distribution are conjugate. Note that the discrete distriution is sometimes called the categorical distribution or the multinomial distribution.

To import the Dirichlet discrete distribution call

from microscopes.models import dd as dirichlet_discrete


Then given the specific model we’d want we’d import

from microscopes.model_name.definition import model_definition

from microscopes.irm.definition import model_definition as irm_definition
from microscopes.mixture.definition import model_definition as mm_definition


See Defining Your Model to learn more about model definitions