.. _topic: Bayesian Nonparametric Topic Modeling with the Daily Kos ======================================================================== The Hierarchical Dirichlet Process (HDP) is typically used for topic modeling when the number of topics is unknown and can be seen as an extension of Latent Dirichlet Allocation. n Bayesian topic modeling, individual words in each document are assigned to one of :math:`K` topics. This means each document has its own distribution over topics and each topic has its own distribution over words. Let's explore the topics of the political blog, The Daily Kos using data from the `UCI Machine Learning Repository `__. We must define our model before we intialize it. In this case, we need the number of docs and the number of words. From there, we can initialize our model and set the hyperparameters .. code:: python defn = model_definition(len(docs), vocab_size) prng = rng() kos_state = initialize(defn, docs, prng, vocab_hp=1, dish_hps={"alpha": 1, "gamma": 1}) r = runner.runner(defn, docs, kos_state) print "number of docs:", defn.n, "vocabulary size:", defn.v .. parsed-literal:: number of docs: 3430 vocabulary size: 6906 After running our model, we can visualze our topics using pyLDAvis. `pyLDAvis `__ is a Python implementation of the `LDAvis `__ tool created by `Carson Sievert `__. LDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization. .. raw:: html

Here are the ten most relevant words for each topic: .. parsed-literal:: dean edwards clark gephardt iowa primary lieberman democratic deans kerry iraq war iraqi military troops soldiers saddam baghdad forces american cheney guard bush mccain boat swift john kerry service president senate race elections seat district republican house money gop carson court delay marriage law ballot gay amendment rights ethics committee kerry bush percent voters poll general polling polls results florida people media party bloggers political time america politics internet message november account electoral governor sunzoo contact faq password login dkosopedia administration bush tax jobs commission health billion budget white cuts ethic weber ndp exemplary bloc passion acts identifying realities responsibility williams orleans freeper ditka excerpt picks anxiety edit trade edited science scientists space environmental cell reagan emissions species cells researchers zimbabwe blessed kingdom heaven interpretation anchor christ cargo clip owners smallest demonstration located unusual zahn predebate auxiliary endangered merit googlebomb ***Topic Prediction*** We can also predict how the topics with be distributed within an arbitrary document. Let's create a document from the 100 most relevant words in the 7th topic. .. parsed-literal:: local reading party channel radio guys america dnc school ive sense hour movie times conservative ndn tnr talking coverage political heard lot air tom liberal moment discussion youre campaign learn wont bloggers heres traffic thought live speech media network isnt makes means boston weve organizations kind community blogging schaller jul readers convention night love life write news family find left read years real piece guest side event politics internet youll blogs michael long line ads dlc blog audience black stuff flag article country message journalists book blogosphere time tonight put ill email film fun people online matter blades theyre web Similarly, if we create a document from words from the 1st and 7th topic, our prediction is that the document is generated mostly by those topics. We'll plot these two documents to compare their distribution over topics. .. image:: hdp_files/hdp_28_1.png ***Topic and Term Distributions*** Of course, we can also get the topic distribution for each document (commonly called :math:`\Theta`). .. image:: hdp_files/hdp_31_1.png We can also get the raw word distribution for each topic (commonly called :math:`\Phi`). This is related to the *word relevance*. Here are the most common words in one of the topics. .. image:: hdp_files/hdp_33_1.png To use our HDP, install our libary in conda: ``$ conda install microscopes-lda``