Getting Started with Topic Modeling and MALLET



Getting Started with Topic Modeling and MALLET
To create an environment variable in Windows 7, click on your Start Menu -> Control Panel -> System -> Advanced System Settings (Figures 1,2,3). Click new and type MALLET_HOME in the variable name box. It must be like this – all caps, with an underscore – since that is the shortcut that the programmer built into the program and all of its subroutines. Then type the exact path (location) of where you unzipped MALLET in the variable value, e.g., c:\mallet.

bin\mallet import-dir --help
To work with this corpus and find out what the topics are that compose these individual documents, we need to transform them from several individual text files into a single MALLET format file. MALLET can import more than one file at a time. We can import the entire directory of text files using the import command. The commands below import the directory, turn it into a MALLET file, keep the original texts in the order in which they were listed, and strip out the stop words (words such as andthebut, and if that occur in such frequencies that they obstruct analysis) using the default English stop-words dictionary. 

This file now contains all of your data, in a format that MALLET can work with.
bin\mallet train-topics --input tutorial.mallet

This command opens your tutorial.mallet file, and runs the topic model routine on it using only the default settings. As it iterates through the routine, trying to find the best division of words into topics.
MALLET includes an element of randomness, so the keyword lists will look different every time the program is run, even if on the same set of data.

bin\mallet train-topics --input tutorial.mallet --num-topics 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_compostion.txt

The second number in each paragraph is the Dirichlet parameter for the topic. This is related to an option which we did not run, and so its default value was used (this is why every topic in this file has the number 2.5).
If when you ran the topic model routine you had included
--optimize-interval 20
bin\mallet train-topics --input tutorial.mallet --num-topics 20 --optimize-interval 20 --output-state topic-state.gz --output-topic-keys tutorial_keys.txt --output-doc-topics tutorial_composition.txt

That is, the first number is the topic (topic 0), and the second number gives an indication of the weight of that topic. In general, including –optimize-interval leads to better topics.


What topics compose your documents? The answer is in thetutorial_composition.txt file. 

How do you know the number of topics to search for? Is there a natural number of topics? What we have found is that one has to run the train-topics with varying numbers of topics to see how the composition file breaks down. If we end up with the majority of our original texts all in a very limited number of topics, then we take that as a signal that we need to increase the number of topics; the settings were too coarse. There are computational ways of searching for this, including using MALLETs hlda command, but for the reader of this tutorial, it is probably just quicker to cycle through a number of iterations 

Read full article from Getting Started with Topic Modeling and MALLET

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts