Getting Started with Topic Modeling and MALLET
To create an environment variable in Windows 7, click on your
bin\mallet import-dir --help
To work with this corpus and find out what the topics are that compose these individual documents, we need to transform them from several individual text files into a single MALLET format file. MALLET can import more than one file at a time. We can import the entire directory of text files using the
This file now contains all of your data, in a format that MALLET can work with.
bin\mallet train-topics --input tutorial.mallet
This command opens your
MALLET includes an element of randomness, so the keyword lists will look different every time the program is run, even if on the same set of data.
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt
The second number in each paragraph is the Dirichlet parameter for the topic. This is related to an option which we did not run, and so its default value was used (this is why every topic in this file has the number 2.5).
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt
How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse. There are computational ways of searching for this, including using MALLETs
To create an environment variable in Windows 7, click on your
Start Menu -> Control Panel -> System -> Advanced System Settings
(Figures 1,2,3). Click new and type MALLET_HOME
in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., c:\mallet
.bin\mallet import-dir --help
To work with this corpus and find out what the topics are that compose these individual documents, we need to transform them from several individual text files into a single MALLET format file. MALLET can import more than one file at a time. We can import the entire directory of text files using the
import
command. The commands below import the directory, turn it into a MALLET file, keep the original texts in the order in which they were listed, and strip out the stop words (words such as and, the, but, and if that occur in such frequencies that they obstruct analysis) using the default English stop-words
dictionary. This file now contains all of your data, in a format that MALLET can work with.
bin\mallet train-topics --input tutorial.mallet
This command opens your
tutorial.mallet
file, and runs the topic model routine on it using only the default settings. As it iterates through the routine, trying to find the best division of words into topics.MALLET includes an element of randomness, so the keyword lists will look different every time the program is run, even if on the same set of data.
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt
The second number in each paragraph is the Dirichlet parameter for the topic. This is related to an option which we did not run, and so its default value was used (this is why every topic in this file has the number 2.5).
If when you ran the topic model routine you had included
--optimize-interval 20
That is, the first number is the topic (topic 0), and the second number gives an indication of the weight of that topic. In general, including
–optimize-interval
leads to better topics.
What topics compose your documents? The answer is in the
tutorial_composition.txt
file. How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse. There are computational ways of searching for this, including using MALLETs
hlda command
, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations - For extensive background and bibliography on topic modeling you may wish to begin with Scott Weingart’s Guided Tour to Topic Modeling
- Ted Underwood’s ‘Topic modeling made just simple enough’ is an important discussion on interpreting the meaning of topics.
- Lisa Rhody’s post on interpreting topics is also illuminating. ‘Some Assembly Required’ Lisa @ Work August 22, 2012.
- Clay Templeton, ‘Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities’, n.d.
- David Blei, Andrew Ng, and Michael Jordan, ‘Latent dirichlet allocation,’ The Journal of Machine Learning Research 3 (2003).
- Finally, also consult David Mimno’s bibliography of topic modeling articles. They’re tagged by topic to make finding the right one for a particular application that much easier. Also take a look at his recent article on Computational Historiography from ACM Transactions on Computational Logic which goes through a hundred years of Classics journals to learn something about the field. While the article should be read as a good example of topic modeling, his ‘Methods’ section is especially important, in that it discusses preparing text for this sort of analysis.
No comments:
Post a Comment