By Lindsay Thomas
The good news about using MALLET is that installing it on your computer is the hardest part. Here are some links to tutorials that include good installation instructions for both Windows and Mac machines:
If you’re using a Mac computer, make sure to unzip the mallet-2.0.8
file in your home (i.e., user) folder. If you’re using a Windows machine, move it into your C:\
drive. For Windows machines, MALLET will not work if it is not located directly in your C:\
drive (i.e., do not put it in another folder inside your C:\
drive).
To check if you’ve installed MALLET correctly:
mallet-2.0.8
folder.bin\mallet
(Windows) or bin/mallet
(Mac). If MALLET is installed correctly on your computer, you will now see a list of MALLET commands. If it isn’t installed correctly, you’ll see an error message. Check to make sure that you’ve typed the command correctly, that you’re actually located within the MALLET folder, and that you set up your environment variable correctly.I’ve provided you with a zip file of 1000 text files drawn from the WhatEvery1Says project data. This folder includes 500 newspaper articles classified as being about the humanities and 500 newspaper articles as being classified about science (if you took my class last spring you worked with this data in our Voyant lab). Unzip this file in the mallet-2.0.8
folder you just downloaded and installed. This means that your MALLET folder should now also include an unzipped folder titled we1s-data
that includes the text files we will use in this tutorial.
To work with text data in MALLET, we first have to transform our corpus of text files into a MALLET format file. We do this using the import
command. We will discuss the components of this command during class on March 9.
Making sure you are in the mallet-2.0.8
folder, type the below command:
Windows:
bin\mallet import-dir --input we1s-data we1s.mallet --keep-sequence --remove-stopwords
Mac:
bin/mallet import-dir --input we1s-data we1s.mallet --keep-sequence --remove-stopwords
Now that we have created a MALLET format file, we can run a topic model. We do this using the train-topics
command. There are many different parameters we can use to customize our model and model output; these are listed in the MALLET Topic Modeling documentation. We will discuss the components of this command during class on March 9.
Making sure you are still in the mallet-2.0.8
folder, type the below command:
Windows:
bin\mallet train-topics --input we1s.mallet --num-topics 25 --optimize-interval 10 --output-state we1s-topic-state.gz --output-topic-keys we1s-keys.txt --output-doc-topics we1s-composition.txt --word-topic-counts-file we1s-topic-counts.txt --diagnostics-file we1s-diagnostics.xml
Mac:
bin/mallet train-topics --input we1s.mallet --num-topics 25 --optimize-interval 10 --output-state we1s-topic-state.gz --output-topic-keys we1s-keys.txt --output-doc-topics we1s-composition.txt --word-topic-counts-file we1s-topic-counts.txt --diagnostics-file we1s-diagnostics.xml
We will look at each of these files together, but here is a list of the files we asked MALLET to produce for us (after the topic modeling process completes, you should find these in your mallet-2.0.8
folder):
we1s-topic-state.gz
): A compressed file containing every word in your corpus and each topic it contributes to.we1s-keys.txt
): A text file displaying the top words for each topic.we1s-composition.txt
): A text file indicating, by percentage, each topic that each document in your corpus contributes to.we1s-topic-counts.txt
): A text file indicating which topics each unique word in your corpus contributes to.we1s-diagnostics.xml
): An xml file indicating a variety of diagnostics evaluating the overall fit and accuracy of your model. Find out more about what these metrics are in the MALLET documentation.You can also use Voyant to do topic modeling. Voyant uses the same kind of topic modeling that MALLET uses (written by the same person). Here’s how to do topic modeling using Voyant: