topik package¶
Subpackages¶
Submodules¶
topik.cli module¶
topik.readers module¶
-
topik.readers.
read_input
(source, content_field, source_type='auto', output_type='dictionary', output_args=None, synchronous_wait=0, **kwargs)[source]¶ Read data from given source into Topik’s internal data structures.
Parameters: source : str
input data. Can be file path, directory, or server address.
content_field : str
Which field contains your data to be analyzed. Hash of this is used as id.
source_type : str
“auto” tries to figure out data type of source. Can be manually specified instead. options for manual specification are [‘solr’, ‘elastic’, ‘json_stream’, ‘large_json’, ‘folder’]
output_type : str
Internal format for handling user data. Current options are in the registered_outputs dictionary. Default is DictionaryCorpus class. Specify alternatives using string key from dictionary.
output_args : dict
Configuration to pass through to output
synchronous_wait : positive, real number
Time in seconds to wait for data to finish uploading to output (this happens asynchronously.) Only relevant for some output types (“elastic”, not “dictionary”)
kwargs : any other arguments to pass to input parsers
Returns: iterable output object
Examples
>>> raw_data = read_input( ... '{}/test_data_json_stream.json'.format(test_data_path), ... content_field="abstract") >>> id, text = next(iter(raw_data)) >>> text == ( ... u'Transition metal oxides are being considered as the next generation '+ ... u'materials in field such as electronics and advanced catalysts; '+ ... u'between them is Tantalum (V) Oxide; however, there are few reports '+ ... u'for the synthesis of this material at the nanometer size which could '+ ... u'have unusual properties. Hence, in this work we present the '+ ... u'synthesis of Ta2O5 nanorods by sol gel method using DNA as structure '+ ... u'directing agent, the size of the nanorods was of the order of 40 to '+ ... u'100 nm in diameter and several microns in length; this easy method '+ ... u'can be useful in the preparation of nanomaterials for electronics, '+ ... u'biomedical applications as well as catalysts.') True
topik.run module¶
-
topik.run.
run_model
(data_source, source_type='auto', year_field=None, start_year=None, stop_year=None, content_field=None, tokenizer='simple', n_topics=10, dir_path='./topic_model', model='LDA', termite_plot=True, output_file=False, ldavis=False, seed=42, **kwargs)[source]¶ Run your data through all topik functionality and save all results to a specified directory.
Parameters: data_source : str
Input data (e.g. file or folder or solr/elasticsearch instance).
source_type : {‘json_stream’, ‘folder_files’, ‘json_large’, ‘solr’, ‘elastic’}.
The format of your data input. Currently available a json stream or a folder containing text files. Default is ‘json_stream’
year_field : str
The field name (if any) that contains the year associated with each document (for filtering).
start_year : int
For beginning of range filter on year_field values
stop_year : int
For beginning of range filter on year_field values
content_field : string
The primary text field to parse.
tokenizer : {‘simple’, ‘collocations’, ‘entities’, ‘mixed’}
The type of tokenizer to use. Default is ‘simple’.
n_topics : int
Number of topics to find in your data
dir_path : str
Directory path to store all topic modeling results files. Default is ./topic_model.
model : {‘LDA’, ‘PLSA’}.
Statistical modeling algorithm to use. Default ‘LDA’.
termite_plot : bool
Generate termite plot of your model if True. Default is True.
output_file : bool
Generate a final summary csv file of your results. For each document: text, tokens, lda_probabilities and topic.
ldavis : bool
Generate an interactive data visualization of your topics. Default is False.
seed : int
Set random number generator to seed, to be able to reproduce results. Default 42.
**kwargs : additional keyword arguments, passed through to each individual step
topik.tokenizers module¶
-
topik.tokenizers.
collect_bigrams_and_trigrams
(collection, top_n=10000, min_length=1, min_bigram_freq=50, min_trigram_freq=20, stopwords=None)[source]¶ collects bigrams and trigrams from collection of documents. Input to collocation tokenizer.
bigrams are pairs of words that recur in the collection; trigrams are triplets.
Parameters: collection : iterable of str
body of documents to examine
top_n : int
limit results to this many entries
min_length : int
Minimum length of any single word
min_bigram_freq : int
threshold of when to consider a pair of words as a recognized bigram
min_trigram_freq : int
threshold of when to consider a triplet of words as a recognized trigram
stopwords : None or iterable of str
Collection of words to ignore as tokens
Examples
>>> from topik.readers import read_input >>> raw_data = read_input( ... '{}/test_data_json_stream.json'.format(test_data_path), ... content_field="abstract") >>> bigrams, trigrams = collect_bigrams_and_trigrams(raw_data, min_bigram_freq=5, min_trigram_freq=3) >>> bigrams.pattern u'(free standing|ac electrodeposition|centered cubic|spatial resolution|vapor deposition|wear resistance|plastic deformation|electrical conductivity|field magnets|v o|transmission electron|x ray|et al|ray diffraction|electron microscopy|room temperature|diffraction xrd|electron microscope|results indicate|scanning electron|m s|doped zno|microscopy tem|polymer matrix|size distribution|mechanical properties|grain size|diameters nm|high spatial|particle size|high resolution|ni al|diameter nm|range nm|high field|high strength|c c)' >>> trigrams.pattern u'(differential scanning calorimetry|face centered cubic|ray microanalysis analytical|physical vapor deposition|transmission electron microscopy|x ray diffraction|microanalysis analytical electron|chemical vapor deposition|high aspect ratio|analytical electron microscope|ray diffraction xrd|x ray microanalysis|high spatial resolution|high field magnets|atomic force microscopy|electron microscopy tem|narrow size distribution|scanning electron microscopy|building high field|silicon oxide nanowires|particle size nm)'
-
topik.tokenizers.
collect_entities
(collection, freq_min=2, freq_max=10000)[source]¶ Return noun phrases from collection of documents.
Parameters: collection: Corpus-base derived object or iterable collection of raw text
freq_min: int
Minimum frequency of a noun phrase occurrences in order to retrieve it. Default is 2.
freq_max: int
Maximum frequency of a noun phrase occurrences in order to retrieve it. Default is 10000.
-
topik.tokenizers.
tokenize_collocation
(text, patterns, min_length=1, stopwords=None)[source]¶ A text tokenizer that includes collocations(bigrams and trigrams).
A collocation is sequence of words or terms that co-occur more often than would be expected by chance. This function breaks a raw document up into tokens based on a pre-established collection of bigrams and trigrams. This collection is derived from a body of many documents, and must be obtained in a prior step using the collect_bigrams_and_trigrams function.
Uses nltk.collocations.TrigramCollocationFinder to find trigrams and bigrams.
Parameters: text : str
A single document’s text to be tokenized
patterns: tuple of compiled regex object to find n-grams
Obtained from collect_bigrams_and_trigrams function
min_length : int
Minimum length of any single word
stopwords : None or iterable of str
Collection of words to ignore as tokens
Examples
>>> from topik.readers import read_input >>> id_documents = read_input('{}/test_data_json_stream.json'.format(test_data_path), content_field="abstract") >>> patterns = collect_bigrams_and_trigrams(id_documents, min_bigram_freq=2, min_trigram_freq=2) >>> id, doc_text = next(iter(id_documents)) >>> tokenized_text = tokenize_collocation(doc_text, patterns) >>> tokenized_text [u'transition_metal', u'oxides', u'considered', u'generation', u'materials', u'field', u'electronics', u'advanced', u'catalysts', u'tantalum', u'v_oxide', u'reports', u'synthesis_material', u'nanometer_size', u'unusual', u'properties', u'work_present', u'synthesis', u'ta', u'o', u'nanorods', u'sol', u'gel', u'method', u'dna', u'structure', u'directing', u'agent', u'size', u'nanorods', u'order', u'nm_diameter', u'microns', u'length', u'easy', u'method', u'useful', u'preparation', u'nanomaterials', u'electronics', u'biomedical', u'applications', u'catalysts']
-
topik.tokenizers.
tokenize_entities
(text, entities, min_length=1, stopwords=None)[source]¶ A tokenizer that extracts noun phrases from text.
Requires that you first establish entities using the collect_entities function
Parameters: text : str
A single document’s text to be tokenized
entities : iterable of str
Collection of noun phrases, obtained from collect_entities function
min_length : int
Minimum length of any single word
stopwords : None or iterable of str
Collection of words to ignore as tokens
Examples
>>> from topik.readers import read_input >>> id_documents = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract") >>> entities = collect_entities(id_documents) >>> len(entities) 220 >>> i = iter(id_documents) >>> _, doc_text = next(i) >>> doc_text u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.' >>> tokenized_text = tokenize_entities(doc_text, entities) >>> tokenized_text [u'transition']
-
topik.tokenizers.
tokenize_mixed
(text, entities, min_length=1, stopwords=None)[source]¶ A text tokenizer that retrieves entities (‘noun phrases’) first and simple words for the rest of the text.
Parameters: text : str
A single document’s text to be tokenized
entities : iterable of str
Collection of noun phrases, obtained from collect_entities function
min_length : int
Minimum length of any single word
stopwords: None or iterable of str
Collection of words to ignore as tokens
Examples
>>> from topik.readers import read_input >>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), content_field="abstract") >>> entities = collect_entities(raw_data) >>> id, text = next(iter(raw_data)) >>> tokenized_text = tokenize_mixed(text, entities, min_length=3) >>> tokenized_text [u'transition', u'metal', u'oxides', u'generation', u'materials', u'tantalum', u'oxide', u'nanometer', u'size', u'unusual', u'properties', u'sol', u'gel', u'method', u'dna', u'easy', u'method', u'biomedical', u'applications']
-
topik.tokenizers.
tokenize_simple
(text, min_length=1, stopwords=None)[source]¶ A text tokenizer that simply lowercases, matches alphabetic characters and removes stopwords.
Parameters: text : str
A single document’s text to be tokenized
entities : iterable of str
Collection of noun phrases, obtained from collect_entities function
min_length : int
Minimum length of any single word
stopwords: None or iterable of str
Collection of words to ignore as tokens
Examples
>>> from topik.readers import read_input >>> id_documents = read_input( ... '{}/test_data_json_stream.json'.format(test_data_path), ... content_field="abstract") >>> id, doc_text = next(iter(id_documents)) >>> doc_text u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.' >>> tokens = tokenize_simple(doc_text) >>> tokens [u'transition', u'metal', u'oxides', u'considered', u'generation', u'materials', u'field', u'electronics', u'advanced', u'catalysts', u'tantalum', u'v', u'oxide', u'reports', u'synthesis', u'material', u'nanometer', u'size', u'unusual', u'properties', u'work', u'present', u'synthesis', u'ta', u'o', u'nanorods', u'sol', u'gel', u'method', u'dna', u'structure', u'directing', u'agent', u'size', u'nanorods', u'order', u'nm', u'diameter', u'microns', u'length', u'easy', u'method', u'useful', u'preparation', u'nanomaterials', u'electronics', u'biomedical', u'applications', u'catalysts']
topik.utils module¶
topik.viz module¶
-
class
topik.viz.
Termite
(input_file, title)[source]¶ Bases:
object
A Bokeh Termite Visualization for LDA results analysis.
Parameters: input_file : str or pandas DataFrame
A pandas dataframe from a topik model get_termite_data() containing columns “word”, “topic” and “weight”. May also be a string, in which case the string is a filename of a csv file with the above columns.
title : str
The title for your termite plot
Examples
>>> termite = Termite("{}/termite.csv".format(test_data_path), ... "My lda results") >>> termite.plot('my_termite.html')