The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown and can be seen as an extension of Latent Dirichlet Allocation.
n Bayesian topic modeling, individual words in each document are assigned to one of \(K\) topics. This means each document has its own distribution over topics and each topic has its own distribution over words.
Let’s explore the topics of the political blog, The Daily Kos using data from the UCI Machine Learning Repository.
We must define our model before we intialize it. In this case, we need the number of docs and the number of words.
From there, we can initialize our model and set the hyperparameters
defn = model_definition(len(docs), vocab_size)
prng = rng()
kos_state = initialize(defn, docs, prng,
vocab_hp=1,
dish_hps={"alpha": 1, "gamma": 1})
r = runner.runner(defn, docs, kos_state)
print "number of docs:", defn.n, "vocabulary size:", defn.v
number of docs: 3430 vocabulary size: 6906
After running our model, we can visualze our topics using pyLDAvis. pyLDAvis is a Python implementation of the LDAvis tool created by Carson Sievert.
LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.
Here are the ten most relevant words for each topic:
dean edwards clark gephardt iowa primary lieberman democratic deans kerry
iraq war iraqi military troops soldiers saddam baghdad forces american
cheney guard bush mccain boat swift john kerry service president
senate race elections seat district republican house money gop carson
court delay marriage law ballot gay amendment rights ethics committee
kerry bush percent voters poll general polling polls results florida
people media party bloggers political time america politics internet message
november account electoral governor sunzoo contact faq password login dkosopedia
administration bush tax jobs commission health billion budget white cuts
ethic weber ndp exemplary bloc passion acts identifying realities responsibility
williams orleans freeper ditka excerpt picks anxiety edit trade edited
science scientists space environmental cell reagan emissions species cells researchers
zimbabwe blessed kingdom heaven interpretation anchor christ cargo clip owners
smallest demonstration located unusual zahn predebate auxiliary endangered merit googlebomb
*Topic Prediction*
We can also predict how the topics with be distributed within an arbitrary document.
Let’s create a document from the 100 most relevant words in the 7th topic.
local reading party channel radio guys america dnc school ive sense hour movie times conservative ndn tnr talking coverage political heard lot air tom liberal moment discussion youre campaign learn wont bloggers heres traffic thought live speech media network isnt makes means boston weve organizations kind community blogging schaller jul readers convention night love life write news family find left read years real piece guest side event politics internet youll blogs michael long line ads dlc blog audience black stuff flag article country message journalists book blogosphere time tonight put ill email film fun people online matter blades theyre web
Similarly, if we create a document from words from the 1st and 7th topic, our prediction is that the document is generated mostly by those topics. We’ll plot these two documents to compare their distribution over topics.
*Topic and Term Distributions*
Of course, we can also get the topic distribution for each document (commonly called \(\Theta\)).
We can also get the raw word distribution for each topic (commonly called \(\Phi\)). This is related to the word relevance. Here are the most common words in one of the topics.
To use our HDP, install our libary in conda:
$ conda install microscopes-lda