.. _ncluster: Finding the number of clusters with the Dirichlet Process ============================================================ -------------- The Iris Flower Dataset __ is a standard machine learning data set dating back to the 1930s. It contains measurements from 150 flowers. .. image:: iris_unlabeled.png If we wanted to learn these underlying species' measurements, we would use these real valued measurements and make assumptions about the structure of the data. We could assume that conditioned on assignment to species, the measurement data follwed a multivariate normal .. math:: P(\mathbf{x}|species=s)\sim\mathcal{N}(\mu_{s},\Sigma_{s}) In this case, the data comes from three species of flowers: - Iris Setosa - Iris Versicolor - Iris Virginica Plotting the data shows that indiviudal species exhibit a typical range of measurements .. image:: normal-inverse-wishart_files/normal-inverse-wishart_5_0.png The dimensionality of a clustering model is equal to the number of clusters in the model. Bayesian clustering algorithms often rely of the Dirichlet Distribution to encode prior information about these cluster assignments. The Dirichlet distribution __ (DD) can be considered a distribution of distributions. Each sample from the DD is a categorial distribution __ over :math:K categories. It is parameterized :math:G_0, a distribution over :math:K categories and :math:\alpha, a scale factor. The expected value of the DD is :math:G_0. The variance of the DD is a function of the scale factor. When :math:\alpha is large, samples from :math:DD(\alpha\cdot G_0) will be very close to :math:G_0. When :math:\alpha is small, samples will vary more widely. Most clustering algorithms require the number of clusters to be given in advance. If we don't know the number of species in the data, how would we be able to classify these flowers? The Dirichlet Process __ is a stochastic process used in Bayesian nonparametrics to cluster data without specifying the number of clusters in advance. The Dirichlet Process Prior is used to mathematically descirbe prior information about the number of clusters in the data. The Dirichlet Process Prior is nonparametric because its dimensionality is infinite. This characteristic is advantageous in clustering because it allows us to recognize new clusters when we observe new data. The Dirichlet Process can be considered a way to *generalize* the Dirichlet distribution. While the Dirichlet distribution is parameterized by a discrete distribution :math:G_0 and generates samples that are similar discrete distributions, the Dirichlet process is parameterized by a generic distribution :math:H_0 and generates samples which are distributions similar to :math:H_0. The Dirichlet process also has a parameter :math:\alpha that determines how similar how widely samples will vary from :math:H_0. When learning a clustering using a Dirichlet Process Prior, observations are probabilistically assigned to clusters based on the number of observations in that cluster :math:n_k. .. math:: P(\text{cluster=k})=\frac{n_k}{\alpha+n-1} Probability of new clusters are paramteterized by :math:\alpha .. math:: P(\text{cluster=new})=\frac{\alpha}{\alpha+n-1} In other words, these culsters exhbit a rich get richer property. The expected number of clusters in a dataset clustered with the Dirichlet Process is :math:O(\alpha\log(N)) The expected number of clusters with :math:m observations (Arratia et al., 2003) __ is :math:\frac{\alpha}{m} We can construct a sample :math:H (recall that :math:H is a probability distribution) from a Dirichlet process :math:\text{DP}(\alpha H_0) by drawing a countably infinite number of samples :math:\theta_k from :math:H_0) and setting: .. math:: H=\sum_{k=1}^\infty \pi_k \cdot\delta(x-\theta_k) where the :math:\pi_k are carefully chosen weights (more later) that sum to 1. (:math:\delta is the Dirac delta function __.) :math:H, a sample from :math:DP(\alpha H_0), is a *probability distribution* that looks similar to :math:H_0 (also a distribution). In particular, :math:H is a *discrete* distribution that takes the value :math:\theta_k with probability :math:\pi_k. This sampled distribution :math:H is a discrete distribution *even if* :math:H_0 *has continuous support*; the support __ of :math:H is a countably infinite subset of the support :math:H_0. The weights (:math:\pi_k values) of a Dirichlet process sample related the Dirichlet *process* back to the Dirichlet *distribution*. Gregor Heinrich __ writes: The defining property of the DP is that its samples have weights :math:\pi_k and locations :math:\theta_k distributed in such a way that when partitioning :math:S(H) into finitely many arbitrary disjoint subsets :math:S_1, \ldots, S_j :math:J<\infty, the sums of the weights :math:\pi_k in each of these :math:J subsets are distributed according to a Dirichlet distribution that is parameterized by :math:\alpha and a discrete base distribution (like :math:G_0) whose weights are equal to the integrals of the base distribution :math:H_0 over the subsets :math:S_n. As an example, Heinrich imagines a DP with a standard normal base measure :math:H_0\sim \mathcal{N}(0,1). Let :math:H be a sample from :math:DP(H_0) and partition the real line (the support of a normal distribution) as :math:S_1=(-\infty, -1], :math:S_2=(-1, 1], and :math:S_3=(1, \infty] then .. math:: H(S_1),H(S_2), H(S_3) \sim \text{Dir}\left(\alpha\,\text{erf}(-1), \alpha\,(\text{erf}(1) - \text{erf}(-1)), \alpha\,(1-\text{erf}(1))\right) where :math:H(S_n) be the sum of the :math:\pi_k values whose :math:\theta_k lie in :math:S_n. These :math:S_n subsets are chosen for convenience, however similar results would hold for *any* choice of :math:S_n. For any sample from a Dirichlet *process*, we can construct a sample from a Dirichlet *distribution* by partitioning the support of the sample into a finite number of bins. There are several equivalent ways to choose the :math:\pi_k so that this property is satisfied: the Chinese restaurant process, the stick-breaking process, and the PĆ³lya urn scheme. To generate :math:\left\{\pi_k\right\} according to a stick-breaking process we define :math:\beta_k to be a sample from :math:\text{Beta}(1,\alpha). :math:\pi_1 is equal to :math:\beta_1. Successive values are defined recursively as .. math:: \pi_k=\beta_k \prod_{j=1}^{k-1}(1-\beta_j). Thus, if we want to draw a sample from a Dirichlet process, we could, in theory, sample an infinite number of :math:\theta_k values from the base distribution :math:H_0, an infinite number of :math:\beta_k values from the Beta distribution. Of course, sampling an infinite number of values is easier in theory than in practice. However, by noting that the :math:\pi_k values are *positive* values summing to 1, we note that, in expectation, they must get increasingly small as :math:k\rightarrow\infty. Thus, we can reasonably approximate a sample :math:H\sim DP(\alpha H_0) by drawing *enough* samples such that :math:\sum_{k=1}^K \pi_k\approx 1. We use this method below to draw approximate samples from several Dirichlet processes with a standard normal (:math:\mathcal{N}(0,1)) base distribution but varying :math:\alpha values. Recall that a single sample from a Dirichlet process is a probability distribution over a countably infinite subset of the support of the base measure. The blue line is the PDF for a standard normal. The black lines represent the :math:\theta_k and :math:\pi_k values; :math:\theta_k is indicated by the position of the black line on the :math:x-axis; :math:\pi_k is proportional to the height of each line. We generate enough :math:\pi_k values are generated so their sum is greater than 0.99. When :math:\alpha is small, very few :math:\theta_k's will have corresponding :math:\pi_k values larger than :math:0.01. However, as :math:\alpha grows large, the sample becomes a more accurate (though still discrete) approximation of :math:\mathcal{N}(0,1). .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_5_0.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_5_1.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_5_2.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_5_3.png Often we want to draw samples from a *distribution sampled from a Dirichlet Process* instead of from the Dirichlet process itself. Much of the literature on the topic unhelpful refers to this as sampling from a Dirichlet process. Fortunately, we don't have to draw an infinite number of samples from the base distribution and stick breaking process to do this. Instead, we can draw these samples *as they are needed*. Suppose, for example, we know a finite number of the :math:\theta_k and :math:\pi_k values for a sample :math:H\sim \text{Dir}(\alpha H_0). For example, we know .. math:: \pi_1=0.5,\; \pi_3=0.3,\; \theta_1=0.1,\; \theta_2=-0.5. To sample from :math:H, we can generate a uniform random :math:u number between 0 and 1. If :math:u is less than 0.5, our sample is :math:0.1. If :math:0.5<=u<0.8, our sample is :math:-0.5. If :math:u>=0.8, our sample (from :math:H will be a new sample :math:\theta_3 from :math:H_0. At the same time, we should also sample and store :math:\pi_3. When we draw our next sample, we will again draw :math:u\sim\text{Uniform}(0,1) but will compare against :math:\pi_1, \pi_2, AND :math:\pi_3. The class below will take a base distribution :math:H_0 and :math:\alpha as arguments to its constructor. The class instance can then be called to generate samples from :math:H\sim \text{DP}(\alpha H_0). .. code:: python from numpy.random import choice class DirichletProcessSample(): def __init__(self, base_measure, alpha): self.base_measure = base_measure self.alpha = alpha self.cache = [] self.weights = [] self.total_stick_used = 0. def __call__(self): remaining = 1.0 - self.total_stick_used i = DirichletProcessSample.roll_die(self.weights + [remaining]) if i is not None and i < len(self.weights) : return self.cache[i] else: stick_piece = beta(1, self.alpha).rvs() * remaining self.total_stick_used += stick_piece self.weights.append(stick_piece) new_value = self.base_measure() self.cache.append(new_value) return new_value @staticmethod def roll_die(weights): if weights: return choice(range(len(weights)), p=weights) else: return None This Dirichlet process class could be called *stochastic memoization*. This idea was first articulated in somewhat abstruse terms by Daniel Roy, et al __. Below are histograms of 10000 samples drawn from *samples* drawn from Dirichlet processes with standard normal base distribution and varying :math:\alpha values. .. code:: python import pandas as pd base_measure = lambda: norm().rvs() n_samples = 10000 samples = {} for alpha in [1, 10, 100, 1000]: dirichlet_norm = DirichletProcessSample(base_measure=base_measure, alpha=alpha) plt.figure(figsize=(9,6)) pd.Series([dirichlet_norm() for _ in range(n_samples)]).hist() plt.title('Alpha: %s' % alpha) plt.savefig('Alpha_Hist_%s.png') .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_9_0.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_9_1.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_9_2.png .. image:: 2015-07-28-dirichlet-distribution-dirichlet-process_files/2015-07-28-dirichlet-distribution-dirichlet-process_9_3.png Note that these histograms look very similar to the corresponding plots of sampled distributions above. However, these histograms are plotting *points sampled from a distribution sampled from a Dirichlet process* while the plots above were showing approximate *distributions samples from the Dirichlet process*. Of course, as the number of samples from each :math:H grows large, we would expect the histogram to be a very good empirical approximation of :math:H. For more information on the Dirichlet Process, please see our developer Tim Hopper's additional notebooks __.