topik.tokenizers package¶
Subpackages¶
Submodules¶
topik.tokenizers.entities module¶
-
topik.tokenizers.entities.
entities
(corpus, min_length=1, freq_min=2, freq_max=10000, stopwords=None)[source]¶ A tokenizer that extracts noun phrases from a corpus, then tokenizes all documents using those extracted phrases.
Parameters: corpus : iterable of str
A collection of text to be tokenized
min_length : int
Minimum length of any single word
freq_min : int
Minimum occurrence of phrase in order to be considered
freq_max : int
Maximum occurrence of phrase, beyond which it is ignored
stopwords : None or iterable of str
Collection of words to ignore as tokens
Examples
>>> tokenized_corpora = entities(sample_corpus) >>> next(tokenized_corpora) == ('doc1', ... [u'frank', u'swank_tank', u'prancercise', u'sassy_unicorns']) True
-
topik.tokenizers.entities.
mixed
(corpus, min_length=1, freq_min=2, freq_max=10000, stopwords=None)[source]¶ A text tokenizer that retrieves entities (‘noun phrases’) first and simple words for the rest of the text.
Parameters: corpus : iterable of str
A collection of text to be tokenized
min_length : int
Minimum length of any single word
freq_min : int
Minimum occurrence of phrase in order to be considered
freq_max : int
Maximum occurrence of phrase, beyond which it is ignored
stopwords : None or iterable of str
Collection of words to ignore as tokens
Examples
>>> tokenized_corpora = entities(sample_corpus) >>> next(tokenized_corpora) == ('doc1', ... [u'frank', u'swank_tank', u'prancercise', u'sassy_unicorns']) True
topik.tokenizers.ngrams module¶
-
topik.tokenizers.ngrams.
ngrams
(raw_corpus, min_length=1, freq_bounds=None, top_n=10000, stopwords=None)[source]¶ A tokenizer that extracts collocations (bigrams and trigrams) from a corpus according to the frequency bounds, then tokenizes all documents using those extracted phrases.
Parameters: raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))
body of documents to examine
min_length : int
Minimum length of any single word
freq_bounds : list of tuples of ints
Currently ngrams supports bigrams and trigrams, so this list should contain two tuples (the first for bigrams, the second for trigrams), where each tuple consists of a (minimum, maximum) corpus-wide frequency.
top_n : int
limit results to this many entries
stopwords: None or iterable of str
Collection of words to ignore as tokens
Examples
>>> tokenized_corpora = ngrams(sample_corpus, freq_bounds=[(2,100),(2,100)]) >>> next(tokenized_corpora) == ('doc1', ... [u'frank_swank', u'tank', u'walked', u'sassy', u'unicorn', u'brony', ... u'prancercise', u'class', u'daily', u'prancercise', u'tremendously', ... u'popular', u'pastime', u'sassy_unicorns', u'retirees', u'alike']) True
topik.tokenizers.simple module¶
-
topik.tokenizers.simple.
simple
(raw_corpus, min_length=1, stopwords=None)[source]¶ A text tokenizer that simply lowercases, matches alphabetic characters and removes stopwords.
Parameters: raw_corpus : iterable of tuple of (doc_id(str/int), doc_text(str))
body of documents to examine
min_length : int
Minimum length of any single word
stopwords: None or iterable of str
Collection of words to ignore as tokens
Examples
>>> sample_corpus = [("doc1", "frank FRANK the frank dog cat"), ... ("doc2", "frank a dog of the llama")] >>> tokenized_corpora = simple(sample_corpus) >>> next(tokenized_corpora) == ("doc1", ... ["frank", "frank", "frank", "dog", "cat"]) True