.. _gauss2d:
Inferring Gaussians with the Dirichlet Process Mixture Model
============================================================
--------------
Let's set up our environment
.. code:: python
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
import time
import seaborn as sns
import pandas as pd
sns.set_style('darkgrid')
sns.set_context('talk')
sns.set_palette("Set2", 30)
Now let's import our functions from datamicroscopes
.. code:: python
from microscopes.common.rng import rng
from microscopes.common.recarray.dataview import numpy_dataview
from microscopes.models import niw as normal_inverse_wishart
from microscopes.mixture.definition import model_definition
from microscopes.mixture import model, runner, query
from microscopes.common.query import zmatrix_heuristic_block_ordering, zmatrix_reorder
From here, we'll generate four isotropic 2D gaussian clusters in each
quadrant, varying the scale parameter
.. code:: python
nsamples_per_cluster = 100
means = np.array([[1, 1], [1, -1], [-1, -1], [-1, 1]], dtype=np.float)
scales = np.array([0.08, 0.09, 0.1, 0.2])
Y_clusters = [
np.random.multivariate_normal(
mean=mu,
cov=var * np.eye(2),
size=nsamples_per_cluster)
for mu, var in zip(means, scales)]
df = pd.DataFrame()
for i, Yc in enumerate(Y_clusters):
cl = pd.DataFrame(Yc, columns = ['x','y'])
cl['cluster'] = i
df = df.append(cl)
Y = np.vstack(Y_clusters)
Y = np.random.permutation(Y)
.. code:: python
df.head()
.. raw:: html
|
x |
y |
cluster |
0 |
1.557005 |
1.266202 |
0 |
1 |
1.465262 |
0.842641 |
0 |
2 |
0.619352 |
1.309368 |
0 |
3 |
1.130965 |
0.700129 |
0 |
4 |
1.447409 |
1.112726 |
0 |
Let's have a look at the generated data
.. code:: python
sns.lmplot('x', 'y', hue="cluster", data=df, fit_reg=False)
plt.title('Simulated Gaussians: 4 Clusters')
.. parsed-literal::
.. image:: gauss2d_files/gauss2d_8_1.png
Now let's learn this clustering non-parametrically!
There are 5 steps necessary to set up your model:
1. Decide on the number of chains we want -- it is important to run
multiple chains from different starting points!
2. Define our DP-GMM model
3. Munge the data into numpy recarray format then wrap the data for our
model
4. Randomize start points
5. Create runners for each chain
.. code:: python
nchains = 8
# The random state object
prng = rng()
# Define a DP-GMM where the Gaussian is 2D
defn = model_definition(Y.shape[0], [normal_inverse_wishart(2)])
# Munge the data into numpy recarray format
Y_rec = np.array([(list(y),) for y in Y], dtype=[('', np.float32, 2)])
# Create a wrapper around the numpy recarray which
# data-microscopes understands
view = numpy_dataview(Y_rec)
# Initialize nchains start points randomly in the state space
latents = [model.initialize(defn, view, prng) for _ in xrange(nchains)]
# Create a runner for each chain
runners = [runner.runner(defn, view, latent, kernel_config=['assign']) for latent in latents]
We will visualize our data to examine the cluster assignment
.. code:: python
def plot_assignment(assignment, data=Y):
cl = pd.DataFrame(data, columns = ['x','y'])
cl['cluster'] = assignment
n_clusters = cl['cluster'].nunique()
sns.lmplot('x', 'y', hue="cluster", data=cl, fit_reg=False, legend=(n_clusters<10))
plt.title('Simulated Gaussians: %d Learned Clusters' % n_clusters)
Let's peek at the starting state for one of our chains
.. code:: python
plot_assignment(latents[0].assignments())
.. image:: gauss2d_files/gauss2d_14_0.png
Let's watch one of the chains evolve for a few steps
.. code:: python
first_runner = runners[0]
for i in xrange(5):
first_runner.run(r=prng, niters=1)
plot_assignment(first_runner.get_latent().assignments())
.. image:: gauss2d_files/gauss2d_16_0.png
.. image:: gauss2d_files/gauss2d_16_1.png
.. image:: gauss2d_files/gauss2d_16_2.png
.. image:: gauss2d_files/gauss2d_16_3.png
.. image:: gauss2d_files/gauss2d_16_4.png
Now let's burn all our runners in for 100 iterations.
We'll do this sequentially since the model is simple, but check
microscopes.parallel.runner for parallel implementions (with support for
either multiprocessing or multyvac)
.. code:: python
for runner in runners:
runner.run(r=prng, niters=100)
Let's now peek again at the first state
.. code:: python
plot_assignment(first_runner.get_latent().assignments())
.. image:: gauss2d_files/gauss2d_20_0.png
Let's build a z-matrix to compare our result with the rest of the chains
We'll be sure to sort our z-matrix before plotting. Sorting the
datapoints allows us to organize the clusters into a block matrix.
.. code:: python
infers = [r.get_latent() for r in runners]
zmat = query.zmatrix(infers)
ordering = zmatrix_heuristic_block_ordering(zmat)
zmat = zmatrix_reorder(zmat, ordering)
.. code:: python
sns.heatmap(zmat, linewidths=0, xticklabels=False, yticklabels=False)
plt.xlabel('entities (sorted)')
plt.ylabel('entities (sorted)')
plt.title('Z-matrix of Cluster Assignments')
.. parsed-literal::
.. image:: gauss2d_files/gauss2d_23_1.png