Introduction Tutorial

In this tutorial we will examine topik with a practical example: Topic Modeling for Movie Reviews.

  • The Movie Review Dataset
  • Using the high-level interface run_topic_model
  • Creating your own custom topic modeling flow
  • Analyzing the results

The Movie Review Dataset

In this tutorial we are going to use the Sentiment Polarity Dataset Version 2.0 from Bo Pang and Lillian Lee. This dataset is distributed with NLTK with permission from the authors.

You can download the individual dataset from NLTK, or download all of ntlk’s dataset, running the following commands from the python interpreter:

For more information on the datasets and download options visit NLTK data.

Instead of using the dataset in for sentiment analysis, its initial purpose, we’ll perform topic modeling on the movie reviews. For that reason, we’ll merge both folders pos and neg, to one named reviews.

High-level interfaces

As mentioned in the introduction page, there a two high-level interfaces: the command-line interface and the function topik.run()

Custom topic modeling flow

Analyzing the results