The next step in topic modeling is to break your documents up into individual
terms. This is called tokenization. Tokenization is done using either the
tokenize() dispatcher function on a Corpus iterator (returned from
or using one of the tokenizer functions directly:
>>> tokenized_corpus = tokenize(raw_data)
The tokenize method accepts a few arguments to specify a tokenization method and
control behavior therein. The available tokenization methods are available in
topik.tokenizers.registered_tokenizers dictionary. The presently available
- “simple”: (default) lowercases input text and extracts single words. Uses Gensim.
- “ngrams”: Collects bigrams and trigrams in addition to single words. Uses NLTK.
- “entities”: Extracts noun phrases as entities. Uses TextBlob.
- “mixed”: first extracts noun phrases as entities, then follows up with simple tokenization for single words. Uses TextBlob.
All methods accept a keyword argument
stopwords, which are words that will
be ignored in tokenization. These are words that add little content value, such
as prepositions. The default,
None, loads and uses gensim’s STOPWORDS
Collocation tokenization collects phrases of words (pairs and triplets, bigrams and trigrams) that occur together often throughout your collection of documents.
To obtain the bigram and trigram patterns, use the
>>> from topik.tokenizers import ngrams >>> tokens = ngrams(corpus, freq_bounds=[(5,10000), (3, 10000)])
Tweakable parameters are:
- top_n: limit results to a maximum number
- min_length: the minimum length that any single word can be
- freq_bounds: list of tuples of [(min_freq, max_freq)]. Min_freq is the minimum number of times that a pair occurs before being considered. The first entry in this list is bigrams. Presently, only bigrams and trigrams are supported.
For small bodies of text, you’ll need small freq values, but this may be correspondingly “noisy.”
You can tweak noun phrase extraction with a minimum and maximum occurrence frequency. This is the frequency across your entire corpus of documents.
>>> from topik.tokenizers import entities >>> tokens = entities(corpus, min_length=1, freq_min=4, freq_max=10000)
mixed() tokenization employs both the entities tokenizer and the simple tokenizer,
for when the entities tokenizer is overly restrictive, or for when words are
interesting both together and apart. Usage is similar to the entities tokenizer:
>>> from topik.tokenizers import mixed >>> tokens = mixed(corpus, min_length=1, freq_min=4, freq_max=10000)