Introduction Tutorial

In this tutorial we will examine topik with a practical example: Topic Modeling for Movie Reviews.

Preparing The Movie Review Dataset

In this tutorial we are going to use the Sentiment Polarity Dataset Version 2.0 from Bo Pang and Lillian Lee.

$ mkdir doc_example
$ cd doc_example
$ curl -o review_polarity.tar.gz
$ tar -zxf review_polarity.tar.gz

Instead of using the dataset for sentiment analysis, its initial purpose, we’ll perform topic modeling on the movie reviews. For that reason, we’ll merge both folders pos and neg, to one named reviews:

$ mkdir reviews
$ mv txt_sentoken/pos/* txt_sentoken/neg/* reviews/

High-level interface

For quick, one-off studies, the command line interface allows you to specify minimal information and obtain topic model plot output. For all available options, please run topik --help

$ topik --help

Usage: topik [OPTIONS]

Run topic modeling

    -d, --data TEXT        Path to input data for topic modeling  [required]
    -c, --field TEXT       the content field to extract text from, or for
                            folders, the field to store text as  [required]
    -f, --format TEXT      Data format provided: json_stream, folder_files,
                            large_json, elastic
    -m, --model TEXT       Statistical topic model: lda, plsa
    -o, --output TEXT      Topic modeling output path
    -t, --tokenizer TEXT   Tokenize method to use: simple, collocations,
                            entities, mix
    -n, --ntopics INTEGER  Number of topics to find
    --termite TEXT         Whether to output a termite plot as a result
    --ldavis TEXT          Whether to output an LDAvis-type plot as a result
    --help                 Show this message and exit.

To run this on our movie reviews data set:

$ topik -d reviews -c text

The shell command is a front end to run_model(), which is also accessible in python:

>>> from import run_pipeline
>>> run_pipeline("./reviews/", content_field="text")

Custom topic modeling flow

For interactive exploration and more efficient, involved workflows, there also exists a Python API for using each part of the topic modeling workflow. There are four phases to topic modeling with topik: data import, tokenization/vectorization, modeling and visualization. Each phase is modular, with several options available to you for each step.

An example complete workflow would be the following:

>>> from topik import read_input, tokenize, vectorize, run_model, visualize
>>> raw_data = read_input("./reviews/")
>>> content_field = "text"
>>> raw_data = ((hash(item[content_field]), item[content_field]) for item in raw_data)
>>> tokenized_corpus = tokenize(raw_data)
>>> vectorized_corpus = vectorize(tokenized_corpus)
>>> ntopics = 10
>>> model = run_model(vectorized_corpus, ntopics=ntopics)
>>> plot = visualize(model)