topik package

Submodules

topik.cli module

topik.readers module

topik.readers.read_input(source, content_field, source_type='auto', output_type='dictionary', output_args=None, synchronous_wait=0, **kwargs)[source]

Read data from given source into Topik’s internal data structures.

Parameters:

source : str

input data. Can be file path, directory, or server address.

content_field : str

Which field contains your data to be analyzed. Hash of this is used as id.

source_type : str

“auto” tries to figure out data type of source. Can be manually specified instead. options for manual specification are [‘solr’, ‘elastic’, ‘json_stream’, ‘large_json’, ‘folder’]

output_type : str

Internal format for handling user data. Current options are in the registered_outputs dictionary. Default is DictionaryCorpus class. Specify alternatives using string key from dictionary.

output_args : dict

Configuration to pass through to output

synchronous_wait : positive, real number

Time in seconds to wait for data to finish uploading to output (this happens asynchronously.) Only relevant for some output types (“elastic”, not “dictionary”)

kwargs : any other arguments to pass to input parsers

Returns:

iterable output object

Examples

>>> raw_data = read_input(
...         '{}/test_data_json_stream.json'.format(test_data_path),
...          content_field="abstract")
>>> id, text = next(iter(raw_data))
>>> text == (
... u'Transition metal oxides are being considered as the next generation '+
... u'materials in field such as electronics and advanced catalysts; '+
... u'between them is Tantalum (V) Oxide; however, there are few reports '+
... u'for the synthesis of this material at the nanometer size which could '+
... u'have unusual properties. Hence, in this work we present the '+
... u'synthesis of Ta2O5 nanorods by sol gel method using DNA as structure '+
... u'directing agent, the size of the nanorods was of the order of 40 to '+
... u'100 nm in diameter and several microns in length; this easy method '+
... u'can be useful in the preparation of nanomaterials for electronics, '+
... u'biomedical applications as well as catalysts.')
True

topik.run module

topik.run.run_model(data_source, source_type='auto', year_field=None, start_year=None, stop_year=None, content_field=None, tokenizer='simple', n_topics=10, dir_path='./topic_model', model='LDA', termite_plot=True, output_file=False, ldavis=False, seed=42, **kwargs)[source]

Run your data through all topik functionality and save all results to a specified directory.

Parameters:

data_source : str

Input data (e.g. file or folder or solr/elasticsearch instance).

source_type : {‘json_stream’, ‘folder_files’, ‘json_large’, ‘solr’, ‘elastic’}.

The format of your data input. Currently available a json stream or a folder containing text files. Default is ‘json_stream’

year_field : str

The field name (if any) that contains the year associated with each document (for filtering).

start_year : int

For beginning of range filter on year_field values

stop_year : int

For beginning of range filter on year_field values

content_field : string

The primary text field to parse.

tokenizer : {‘simple’, ‘collocations’, ‘entities’, ‘mixed’}

The type of tokenizer to use. Default is ‘simple’.

n_topics : int

Number of topics to find in your data

dir_path : str

Directory path to store all topic modeling results files. Default is ./topic_model.

model : {‘LDA’, ‘PLSA’}.

Statistical modeling algorithm to use. Default ‘LDA’.

termite_plot : bool

Generate termite plot of your model if True. Default is True.

output_file : bool

Generate a final summary csv file of your results. For each document: text, tokens, lda_probabilities and topic.

ldavis : bool

Generate an interactive data visualization of your topics. Default is False.

seed : int

Set random number generator to seed, to be able to reproduce results. Default 42.

**kwargs : additional keyword arguments, passed through to each individual step

topik.tokenizers module

topik.tokenizers.collect_bigrams_and_trigrams(collection, top_n=10000, min_length=1, min_bigram_freq=50, min_trigram_freq=20, stopwords=None)[source]

collects bigrams and trigrams from collection of documents. Input to collocation tokenizer.

bigrams are pairs of words that recur in the collection; trigrams are triplets.

Parameters:

collection : iterable of str

body of documents to examine

top_n : int

limit results to this many entries

min_length : int

Minimum length of any single word

min_bigram_freq : int

threshold of when to consider a pair of words as a recognized bigram

min_trigram_freq : int

threshold of when to consider a triplet of words as a recognized trigram

stopwords : None or iterable of str

Collection of words to ignore as tokens

Examples

>>> from topik.readers import read_input
>>> raw_data = read_input(
...                 '{}/test_data_json_stream.json'.format(test_data_path),
...                 content_field="abstract")
>>> bigrams, trigrams = collect_bigrams_and_trigrams(raw_data, min_bigram_freq=5, min_trigram_freq=3)
>>> bigrams.pattern
u'(free standing|ac electrodeposition|centered cubic|spatial resolution|vapor deposition|wear resistance|plastic deformation|electrical conductivity|field magnets|v o|transmission electron|x ray|et al|ray diffraction|electron microscopy|room temperature|diffraction xrd|electron microscope|results indicate|scanning electron|m s|doped zno|microscopy tem|polymer matrix|size distribution|mechanical properties|grain size|diameters nm|high spatial|particle size|high resolution|ni al|diameter nm|range nm|high field|high strength|c c)'
>>> trigrams.pattern
u'(differential scanning calorimetry|face centered cubic|ray microanalysis analytical|physical vapor deposition|transmission electron microscopy|x ray diffraction|microanalysis analytical electron|chemical vapor deposition|high aspect ratio|analytical electron microscope|ray diffraction xrd|x ray microanalysis|high spatial resolution|high field magnets|atomic force microscopy|electron microscopy tem|narrow size distribution|scanning electron microscopy|building high field|silicon oxide nanowires|particle size nm)'
topik.tokenizers.collect_entities(collection, freq_min=2, freq_max=10000)[source]

Return noun phrases from collection of documents.

Parameters:

collection: Corpus-base derived object or iterable collection of raw text

freq_min: int

Minimum frequency of a noun phrase occurrences in order to retrieve it. Default is 2.

freq_max: int

Maximum frequency of a noun phrase occurrences in order to retrieve it. Default is 10000.

topik.tokenizers.tokenize_collocation(text, patterns, min_length=1, stopwords=None)[source]

A text tokenizer that includes collocations(bigrams and trigrams).

A collocation is sequence of words or terms that co-occur more often than would be expected by chance. This function breaks a raw document up into tokens based on a pre-established collection of bigrams and trigrams. This collection is derived from a body of many documents, and must be obtained in a prior step using the collect_bigrams_and_trigrams function.

Uses nltk.collocations.TrigramCollocationFinder to find trigrams and bigrams.

Parameters:

text : str

A single document’s text to be tokenized

patterns: tuple of compiled regex object to find n-grams

Obtained from collect_bigrams_and_trigrams function

min_length : int

Minimum length of any single word

stopwords : None or iterable of str

Collection of words to ignore as tokens

Examples

>>> from topik.readers import read_input
>>> id_documents = read_input('{}/test_data_json_stream.json'.format(test_data_path), content_field="abstract")
>>> patterns = collect_bigrams_and_trigrams(id_documents, min_bigram_freq=2, min_trigram_freq=2)
>>> id, doc_text = next(iter(id_documents))
>>> tokenized_text = tokenize_collocation(doc_text, patterns)
>>> tokenized_text
[u'transition_metal', u'oxides', u'considered', u'generation', u'materials', u'field', u'electronics', u'advanced', u'catalysts', u'tantalum', u'v_oxide', u'reports', u'synthesis_material', u'nanometer_size', u'unusual', u'properties', u'work_present', u'synthesis', u'ta', u'o', u'nanorods', u'sol', u'gel', u'method', u'dna', u'structure', u'directing', u'agent', u'size', u'nanorods', u'order', u'nm_diameter', u'microns', u'length', u'easy', u'method', u'useful', u'preparation', u'nanomaterials', u'electronics', u'biomedical', u'applications', u'catalysts']
topik.tokenizers.tokenize_entities(text, entities, min_length=1, stopwords=None)[source]

A tokenizer that extracts noun phrases from text.

Requires that you first establish entities using the collect_entities function

Parameters:

text : str

A single document’s text to be tokenized

entities : iterable of str

Collection of noun phrases, obtained from collect_entities function

min_length : int

Minimum length of any single word

stopwords : None or iterable of str

Collection of words to ignore as tokens

Examples

>>> from topik.readers import read_input
>>> id_documents = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract")
>>> entities = collect_entities(id_documents)
>>> len(entities)
220
>>> i = iter(id_documents)
>>> _, doc_text = next(i)
>>> doc_text
u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.'
>>> tokenized_text = tokenize_entities(doc_text, entities)
>>> tokenized_text
[u'transition']
topik.tokenizers.tokenize_mixed(text, entities, min_length=1, stopwords=None)[source]

A text tokenizer that retrieves entities (‘noun phrases’) first and simple words for the rest of the text.

Parameters:

text : str

A single document’s text to be tokenized

entities : iterable of str

Collection of noun phrases, obtained from collect_entities function

min_length : int

Minimum length of any single word

stopwords: None or iterable of str

Collection of words to ignore as tokens

Examples

>>> from topik.readers import read_input
>>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), content_field="abstract")
>>> entities = collect_entities(raw_data)
>>> id, text = next(iter(raw_data))
>>> tokenized_text = tokenize_mixed(text, entities, min_length=3)
>>> tokenized_text
[u'transition', u'metal', u'oxides', u'generation', u'materials', u'tantalum', u'oxide', u'nanometer', u'size', u'unusual', u'properties', u'sol', u'gel', u'method', u'dna', u'easy', u'method', u'biomedical', u'applications']
topik.tokenizers.tokenize_simple(text, min_length=1, stopwords=None)[source]

A text tokenizer that simply lowercases, matches alphabetic characters and removes stopwords.

Parameters:

text : str

A single document’s text to be tokenized

entities : iterable of str

Collection of noun phrases, obtained from collect_entities function

min_length : int

Minimum length of any single word

stopwords: None or iterable of str

Collection of words to ignore as tokens

Examples

>>> from topik.readers import read_input
>>> id_documents = read_input(
...                 '{}/test_data_json_stream.json'.format(test_data_path),
...                 content_field="abstract")
>>> id, doc_text = next(iter(id_documents))
>>> doc_text
u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.'
>>> tokens = tokenize_simple(doc_text)
>>> tokens
[u'transition', u'metal', u'oxides', u'considered', u'generation', u'materials', u'field', u'electronics', u'advanced', u'catalysts', u'tantalum', u'v', u'oxide', u'reports', u'synthesis', u'material', u'nanometer', u'size', u'unusual', u'properties', u'work', u'present', u'synthesis', u'ta', u'o', u'nanorods', u'sol', u'gel', u'method', u'dna', u'structure', u'directing', u'agent', u'size', u'nanorods', u'order', u'nm', u'diameter', u'microns', u'length', u'easy', u'method', u'useful', u'preparation', u'nanomaterials', u'electronics', u'biomedical', u'applications', u'catalysts']

topik.utils module

topik.viz module

class topik.viz.Termite(input_file, title)[source]

Bases: object

A Bokeh Termite Visualization for LDA results analysis.

Parameters:

input_file : str or pandas DataFrame

A pandas dataframe from a topik model get_termite_data() containing columns “word”, “topic” and “weight”. May also be a string, in which case the string is a filename of a csv file with the above columns.

title : str

The title for your termite plot

Examples

>>> termite = Termite("{}/termite.csv".format(test_data_path),
...                   "My lda results")
>>> termite.plot('my_termite.html')
plot(output_file='termite.html')[source]
topik.viz.plot_lda_vis(model_data)[source]

Designed to work with to_py_lda_vis() in the model classes.

Module contents