.. currentmodule:: microscopes Categorical Data and the Dirichlet Discrete Distribution ======================================================== -------------- Let's consider some examples of data with categorical variables .. code:: python import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt %matplotlib inline sns.set_context('talk') sns.set_style('darkgrid') First, the passenger list of the Titanic .. code:: python titanic = sns.load_dataset("titanic") .. code:: python titanic.head(n=10) .. raw:: html
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35 0 0 8.0500 S Third man True NaN Southampton no True
5 0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True
6 0 1 male 54 0 0 51.8625 S First man True E Southampton no True
7 0 3 male 2 3 1 21.0750 S Third child False NaN Southampton no False
8 1 3 female 27 0 2 11.1333 S Third woman False NaN Southampton yes False
9 1 2 female 14 1 0 30.0708 C Second child False NaN Cherbourg yes False
One of the categorical variables in this dataset is ``embark_town`` Let's plot the number of passengers departing from each town .. code:: python ax = titanic.groupby(['embark_town'])['age'].count().plot(kind='bar') plt.xticks(rotation=0) plt.xlabel('Departure Town') plt.ylabel('Passengers') plt.title('Number of Passengers by Town of Departure') .. parsed-literal:: .. image:: dirichlet-discrete_files/dirichlet-discrete_6_1.png Let's look at another example: the `cars93 dataset `__ .. code:: python cars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/MASS/Cars93.csv', index_col=0) .. code:: python cars.head() .. raw:: html
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags DriveTrain ... Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1 Acura Integra Small 12.9 15.9 18.8 25 31 None Front ... 5 177 102 68 37 26.5 11 2705 non-USA Acura Integra
2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger Front ... 5 195 115 71 38 30.0 15 3560 non-USA Acura Legend
3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only Front ... 5 180 102 67 37 28.0 14 3375 non-USA Audi 90
4 Audi 100 Midsize 30.8 37.7 44.6 19 26 Driver & Passenger Front ... 6 193 106 70 37 31.0 17 3405 non-USA Audi 100
5 BMW 535i Midsize 23.7 30.0 36.2 22 30 Driver only Rear ... 4 186 109 69 39 27.0 13 3640 non-USA BMW 535i

5 rows × 27 columns

.. code:: python cars.ix[1] .. parsed-literal:: Manufacturer Acura Model Integra Type Small Min.Price 12.9 Price 15.9 Max.Price 18.8 MPG.city 25 MPG.highway 31 AirBags None DriveTrain Front Cylinders 4 EngineSize 1.8 Horsepower 140 RPM 6300 Rev.per.mile 2890 Man.trans.avail Yes Fuel.tank.capacity 13.2 Passengers 5 Length 177 Wheelbase 102 Width 68 Turn.circle 37 Rear.seat.room 26.5 Luggage.room 11 Weight 2705 Origin non-USA Make Acura Integra Name: 1, dtype: object This dataset has multiple categorical variables Based on the description of the cars93 datatset, we'll consider ``Manufacturer``, and ``DriveTrain`` to be categorical variables Let's plot ``Manufacturer`` and ``DriveTrain`` .. code:: python cars.groupby('Manufacturer')['Model'].count().plot(kind='bar') plt.ylabel('Cars') plt.title('Number of Cars by Manufacturer') .. parsed-literal:: .. image:: dirichlet-discrete_files/dirichlet-discrete_12_1.png .. code:: python cars.groupby('DriveTrain')['Model'].count().plot(kind='bar') plt.ylabel('Cars') plt.title('Number of Cars by Drive Train') .. parsed-literal:: .. image:: dirichlet-discrete_files/dirichlet-discrete_13_1.png If our categorical data has labels, we need to convert them to integer id's .. code:: python def col_2_ids(df, col): ids = df[col].drop_duplicates().sort(inplace=False).reset_index(drop=True) ids.index.name = '%s_ids' % col ids = ids.reset_index() df = pd.merge(df, ids, how='left') del df[col] return df .. code:: python cat_columns = ['Manufacturer', 'DriveTrain'] for c in cat_columns: print c cars = col_2_ids(cars, c) .. parsed-literal:: Manufacturer DriveTrain .. code:: python cars[['%s_ids' % c for c in cat_columns]].head() .. raw:: html
Manufacturer_ids DriveTrain_ids
0 0 1
1 0 1
2 1 1
3 1 1
4 2 2
Just as we model binary data with the beta Bernoulli distribution, we can model categorical data with the Dirichlet discrete distribution The beta Bernoulli distribution allows us to learn the underlying probability, :math:`\theta`, of the binary random variable, :math:`x` .. math:: P(x=1) =\theta .. math:: P(x=0) = 1-\theta The Dirichlet discrete distribution extends the beta Bernoulli distribution to the case in which :math:`x` can assume more than two states .. math:: \forall i \in [0,1,...n] \hspace{2mm} P(x = i) = \theta_i .. math:: \sum_{i=0}^n \theta_i = 1 Again, the Dirichlet distribution takes advantage of the fact that the Dirichlet distribution and the discrete distribution are conjugate. Note that the discrete distriution is sometimes called the categorical distribution or the multinomial distribution. To import the Dirichlet discrete distribution call .. code:: python from microscopes.models import dd as dirichlet_discrete Then given the specific model we'd want we'd import ``from microscopes.model_name.definition import model_definition`` .. code:: python from microscopes.irm.definition import model_definition as irm_definition from microscopes.mixture.definition import model_definition as mm_definition See ``Defining Your Model`` to learn more about model definitions