topik.tokenizers package

Submodules

topik.tokenizers.entities module

topik.tokenizers.entities.entities(corpus, min_length=1, freq_min=2, freq_max=10000, stopwords=None)[source]

A tokenizer that extracts noun phrases from a corpus, then tokenizes all documents using those extracted phrases.

Parameters:

corpus : iterable of str

A collection of text to be tokenized

min_length : int

Minimum length of any single word

freq_min : int

Minimum occurrence of phrase in order to be considered

freq_max : int

Maximum occurrence of phrase, beyond which it is ignored

stopwords : None or iterable of str

Collection of words to ignore as tokens

Examples

>>> tokenized_corpora = entities(sample_corpus)
>>> next(tokenized_corpora) == ('doc1',
...     [u'frank', u'swank_tank', u'prancercise', u'sassy_unicorns'])
True
topik.tokenizers.entities.mixed(corpus, min_length=1, freq_min=2, freq_max=10000, stopwords=None)[source]

A text tokenizer that retrieves entities (‘noun phrases’) first and simple words for the rest of the text.

Parameters:

corpus : iterable of str

A collection of text to be tokenized

min_length : int

Minimum length of any single word

freq_min : int

Minimum occurrence of phrase in order to be considered

freq_max : int

Maximum occurrence of phrase, beyond which it is ignored

stopwords : None or iterable of str

Collection of words to ignore as tokens

Examples

>>> tokenized_corpora = entities(sample_corpus)
>>> next(tokenized_corpora) == ('doc1',
...     [u'frank', u'swank_tank', u'prancercise', u'sassy_unicorns'])
True

topik.tokenizers.ngrams module

topik.tokenizers.ngrams.ngrams(raw_corpus, min_length=1, freq_bounds=None, top_n=10000, stopwords=None)[source]

A tokenizer that extracts collocations (bigrams and trigrams) from a corpus according to the frequency bounds, then tokenizes all documents using those extracted phrases.

Parameters:

raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))

body of documents to examine

min_length : int

Minimum length of any single word

freq_bounds : list of tuples of ints

Currently ngrams supports bigrams and trigrams, so this list should contain two tuples (the first for bigrams, the second for trigrams), where each tuple consists of a (minimum, maximum) corpus-wide frequency.

top_n : int

limit results to this many entries

stopwords: None or iterable of str

Collection of words to ignore as tokens

Examples

>>> tokenized_corpora = ngrams(sample_corpus, freq_bounds=[(2,100),(2,100)])
>>> next(tokenized_corpora) == ('doc1',
...     [u'frank_swank', u'tank', u'walked', u'sassy', u'unicorn', u'brony',
...     u'prancercise', u'class', u'daily', u'prancercise', u'tremendously',
...     u'popular', u'pastime', u'sassy_unicorns', u'retirees', u'alike'])
True

topik.tokenizers.simple module

topik.tokenizers.simple.simple(raw_corpus, min_length=1, stopwords=None)[source]

A text tokenizer that simply lowercases, matches alphabetic characters and removes stopwords.

Parameters:

raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))

body of documents to examine

min_length : int

Minimum length of any single word

stopwords: None or iterable of str

Collection of words to ignore as tokens

Examples

>>> sample_corpus = [("doc1", "frank FRANK the frank dog cat"),
...               ("doc2", "frank a dog of the llama")]
>>> tokenized_corpora = simple(sample_corpus)
>>> next(tokenized_corpora) == ("doc1",
... ["frank", "frank", "frank", "dog", "cat"])
True

Module contents