Welcome to Topik’s documentation!¶

Topik is a Topic Modeling toolkit.

What’s a topic model?¶

The following three definitions are a good introduction to topic modeling:

A topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents [1].
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts [2].
Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together [3].

Yet Another Topic Modeling Library¶

Some of you may be wondering why the world needs yet another topic modeling library. There are already great topic modeling libraries out there, see Useful Topic Modeling Resources. In fact topik is built on top of some of them.

The aim of topik is to provide a full suite and high-level interface for anyone interested in applying topic modeling. For that purpose, topik includes many utilities beyond statistical modeling algorithms and wraps all of its features into an easy callable function and a command line interface.

Topik‘s desired goals are the following:

Provide a simple and full-featured pipeline, from text extraction to final results analysis and interactive visualizations.
Integrate available topic modeling resources and features into one common interface, making it accessible to the beginner and/or non-technical user.
Include pre-processing data wrappers into the pipeline.
Provide useful analysis and visualizations on topic modeling results.
Be an easy and beginner-friendly module to contribute to.

Getting Started¶

To demonstrate the ease of a typical topik workflow, we’ll provide two examples: using the command line interface and using the method topik.run.run_topic_model.

Using the command line interface

To get help you can always type topik --help.

$ topik --help

Usage: topik [OPTIONS]

  Run topic modeling

Options:
  -d, --data TEXT        Path to input data for topic modeling  [required]
  -f, --format TEXT      Data format provided: json_stream, folder_files,
                         large_json  [required]
  -m, --model TEXT       Statistical topic model: lda_batch, lda_online
  -o, --output TEXT      Topic modeling output path
  -t, --tokenizer TEXT   Tokenize method to use: simple, collocations,
                         entities, mix
  -n, --ntopics INTEGER  Number of topics to find
  --prefix_value TEXT    In 'large json' files, the prefix_value to extract
                         text from
  --event_value TEXT     In 'large json' files the event_value to extract text
                         from
  --field TEXT           In 'json stream' files, the field to extract text
                         from
  --help                 Show this message and exit.

The following example runs the default model LDA(batch) over a json stream, extracting the field text with simple word tokenization.

$ topik -d ./topik/tests/data/test-data-1.json -f json_stream -o ./test -n 3 --field text -t entities

Using topik.run.run_topic_model

The same previous example using run_topic_model would be:

>>> from topik.run import run_topic_model
>>> run_topic_model(data='./topik/tests/data/test-data-1.json', format='json_stream', n_topics=3, field='text', \
                    dir_path='./topic_model')

To understand topik‘s output and results interpretation, see Topik Output.

Contents¶

User Guide¶

Developer Guide¶

Reference Guide¶

Useful Topic Modeling Resources¶

Topic modeling, David M. Blei

Python libraries¶

License Agreement¶

topik is distributed under the BSD 3-Clause license.

Indices and tables¶

Footnotes¶

[1]	http://en.wikipedia.org/wiki/Topic_model.

[2]	http://www.cs.princeton.edu/~blei/topicmodeling.html

[3]	http://mallet.cs.umass.edu/topics.php