Tokenizing and Vectorizing

The next step in topic modeling is to break your documents up into individual terms. This is called tokenization. Tokenization is done using the tokenize() method on a Corpus object (returned from read_input()):

>>> tokenized_corpus = raw_data.tokenize()

Note on tokenize output

The tokenize() method returns a new object, presently of the DigestedDocumentCollection type. Behind the scenes, the tokenize() method is storing the tokenized text alongside your corpus, using whatever storage backend you have. This is an in-place modification of that object. The new object serves two purposes:

  • It iterates over the particular tokenized representation of your corpus. You may have multiple tokenizations associated with a single corpus. The object returned from the tokenize function tracks the correct one.
  • It also performs vectorization on the fly, counting the number of words in each document, and returning a representation of each document as a bag of words (list of tuples, with each tuple being (word_id, word_count). This is generally the desired input to any topic model.

Make sure you assign this new object to a new variable. It is what you want to feed into the topic modeling step.

Available methods

The tokenize method accepts a few arguments to specify a tokenization method and control behavior therein. The available tokenization methods are available in the tokenizer_methods dictionary. The presently available methods are:

  • “simple”: (default) lowercases input text and extracts single words. Uses Gensim.
  • “collocation”: Collects bigrams and trigrams in addition to single words. Uses NLTK.
  • “entities”: Extracts noun phrases as entities. Uses TextBlob.
  • “mixed”: first extracts noun phrases as entities, then follows up with simple tokenization for single words. Uses TextBlob.

All methods accept a keyword argument stopwords, which are words that will be ignored in tokenization. These are words that add little content value, such as prepositions. The default, None, loads and uses gensim’s STOPWORDS collection.

Collocation tokenization

Collocation tokenization collects phrases of words (pairs and triplets, bigrams and trigrams) that occur together often throughout your collection of documents. There are two steps to tokenization with collocation: establishing the patterns of bigrams and trigrams, and subsequently tokenizing each document individually.

To obtain the bigram and trigram patterns, use the collect_bigrams_and_trigrams() function:

>>> from topik.tokenizers import collect_bigrams_and_trigrams
>>> patterns = collect_bigrams_and_trigrams(corpus)

Parameterization is done at this step, prior to tokenization of the corpus. Tweakable parameters are:

  • top_n: limit results to a maximum number
  • min_length: the minimum length that any single word can be
  • min_bigram_freq: the minimum number of times a pair of words must occur together to be included
  • min_trigram_freq: the minimum number of times a triplet of words must occur together to be included
>>> patterns = collect_bigrams_and_trigrams(corpus, min_length=3, min_bigram_freq=3, min_trigram_freq=3)

For small bodies of text, you’ll need small freq values, but this may be correspondingly “noisy.”

Next, feed the patterns into the tokenize() method of your corpus object:

>>> tokenized_corpus = raw_data.tokenize(method="collocation", patterns=patterns)

Entities tokenization

We refer to entities as noun phrases, as extracted by the TextBlob library. Like collocation tokenization, entities tokenization is a two-step process. First, you establish noun phrases using the collect_entities() function:

>>> from topik.tokenizers import collect_entities
>>> entities = collect_entities(corpus)

You can tweak noun phrase extraction with a minimum and maximum occurrence frequency. This is the frequency across your entire corpus of documents.

>>> entities = collect_entities(corpus, freq_min=4, freq_max=10000)

Next, tokenize the document collection:

>>> tokenized_corpus = raw_data.tokenize(method="entities", entities=entities)

Mixed tokenization

Mixed tokenization employs both the entities tokenizer and the simple tokenizer, for when the entities tokenizer is overly restrictive, or for when words are interesting both together and apart. Usage is similar to the entities tokenizer:

>>> from topik.tokenizers import collect_entities
>>> entities = collect_entities(corpus)
>>> tokenized_corpus = raw_data.tokenize(method="mixed", entities=entities)