topik.fileio package

Submodules

topik.fileio.base_output module

class topik.fileio.base_output.OutputInterface(*args, **kwargs)[source]

Bases: object

close()[source]
get_filtered_data(field_to_get, filter='')[source]
save(filename, saved_data=None)[source]

Persist this object to disk somehow.

You can save your data in any number of files in any format, but at a minimum, you need one json file that describes enough to bootstrap the loading process. Namely, you must have a key called ‘class’ so that upon loading the output, the correct class can be instantiated and used to load any other data. You don’t have to implement anything for saved_data, but it is stored as a key next to ‘class’.

synchronize(max_wait, field)[source]

By default, operations are synchronous and no additional wait is necessary. Data sources that are asynchronous (ElasticSearch) may use this function to wait for “eventual consistency”

topik.fileio.base_output.load_output(filename)[source]

topik.fileio.in_document_folder module

topik.fileio.in_document_folder.read_document_folder(folder, content_field='text')[source]

Iterate over the files in a folder to retrieve the content to process and tokenize.

Parameters:

folder : str

The folder containing the files you want to analyze.

content_field : str

The usage of ‘content_field’ in this source is different from most other sources. The assumption in this source is that each file contains raw text, NOT dictionaries of categorized data. The content_field argument here specifies what key to store the raw text under in the returned dictionary for each document.

Examples

>>> documents = read_document_folder(
...     '{}/test_data_folder_files'.format(test_data_path))
>>> next(documents)['text'] == (
...     u"'Interstellar' was incredible. The visuals, the score, " +
...     u"the acting, were all amazing. The plot is definitely one " +
...     u"of the most original I've seen in a while.")
True

topik.fileio.in_elastic module

topik.fileio.in_elastic.read_elastic(hosts, **kwargs)[source]

Iterate over all documents in the specified elasticsearch intance and index that match the specified query.

kwargs are passed to Elasticsearch class instantiation, and can be used to pass any additional options described at https://elasticsearch-py.readthedocs.org/en/master/

Parameters:

hosts : str or list

Address of the elasticsearch instance any index. May include port, username and password. See https://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch for all options.

content_field : str

The name fo the field that contains the main text body of the document.

**kwargs: additional keyword arguments to be passed to Elasticsearch client instance and to scan query.

topik.fileio.in_json module

topik.fileio.in_json.read_json_stream(filename, json_prefix='item', **kwargs)[source]

Iterate over a json stream of items and get the field that contains the text to process and tokenize.

Parameters:

filename : str

The filename of the json stream.

Examples

>>> documents = read_json_stream(
... '{}/test_data_json_stream.json'.format(test_data_path))
>>> next(documents) == {
... u'doi': u'http://dx.doi.org/10.1557/PROC-879-Z3.3',
... u'title': u'Sol Gel Preparation of Ta2O5 Nanorods Using DNA as Structure Directing Agent',
... u'url': u'http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=8081671&fulltextType=RA&fileId=S1946427400119281.html',
... u'abstract': u'Transition metal oxides are being considered as the next generation materials in field such as electronics and advanced catalysts; between them is Tantalum (V) Oxide; however, there are few reports for the synthesis of this material at the nanometer size which could have unusual properties. Hence, in this work we present the synthesis of Ta2O5 nanorods by sol gel method using DNA as structure directing agent, the size of the nanorods was of the order of 40 to 100 nm in diameter and several microns in length; this easy method can be useful in the preparation of nanomaterials for electronics, biomedical applications as well as catalysts.',
... u'filepath': u'abstracts/879/http%3A%2F%2Fjournals.cambridge.org%2Faction%2FdisplayAbstract%3FfromPage%3Donline%26aid%3D8081671%26fulltextType%3DRA%26fileId%3DS1946427400119281.html',
... u'filename': '{}/test_data_json_stream.json'.format(test_data_path),
... u'vol': u'879',
... u'authors': [u'Humberto A. Monreala', u' Alberto M. Villafañe',
...              u' José G. Chacón', u' Perla E. García',
...              u'Carlos A. Martínez'],
... u'year': u'1917'}
True
topik.fileio.in_json.read_large_json(filename, json_prefix='item', **kwargs)[source]

Iterate over all items and sub-items in a json object that match the specified prefix

Parameters:

filename : str

The filename of the large json file

json_prefix : str

The string representation of the hierarchical prefix where the items of interest may be located within the larger json object.

Try the following script if you need help determining the desired prefix: $ import ijson $ with open(‘test_data_large_json_2.json’, ‘r’) as f: $ parser = ijson.parse(f) $ for prefix, event, value in parser: $ print(“prefix = ‘%r’ || event = ‘%r’ || value = ‘%r’” % $ (prefix, event, value))

Examples

>>> documents = read_large_json(
...             '{}/test_data_large_json.json'.format(test_data_path),
...             json_prefix='item._source.isAuthorOf')
>>> next(documents) == {
... u'a': u'ScholarlyArticle',
... u'name': u'Path planning and formation control via potential function for UAV Quadrotor',
... u'author': [
...     u'http://dig.isi.edu/autonomy/data/author/a.a.a.rizqi',
...     u'http://dig.isi.edu/autonomy/data/author/t.b.adji',
...     u'http://dig.isi.edu/autonomy/data/author/a.i.cahyadi'],
... u'text': u"Potential-function-based control strategy for path planning and formation " +
...     u"control of Quadrotors is proposed in this work. The potential function is " +
...     u"used to attract the Quadrotor to the goal location as well as avoiding the " +
...     u"obstacle. The algorithm to solve the so called local minima problem by utilizing " +
...     u"the wall-following behavior is also explained. The resulted path planning via " +
...     u"potential function strategy is then used to design formation control algorithm. " +
...     u"Using the hybrid virtual leader and behavioral approach schema, the formation " +
...     u"control strategy by means of potential function is proposed. The overall strategy " +
...     u"has been successfully applied to the Quadrotor's model of Parrot AR Drone 2.0 in " +
...     u"Gazebo simulator programmed using Robot Operating System.\nAuthor(s) Rizqi, A.A.A. " +
...     u"Dept. of Electr. Eng. & Inf. Technol., Univ. Gadjah Mada, Yogyakarta, Indonesia " +
...     u"Cahyadi, A.I. ; Adji, T.B.\nReferenced Items are not available for this document.\n" +
...     u"No versions found for this document.\nStandards Dictionary Terms are available to " +
...     u"subscribers only.",
... u'uri': u'http://dig.isi.edu/autonomy/data/article/6871517',
... u'datePublished': u'2014',
... 'filename': '{}/test_data_large_json.json'.format(test_data_path)}
True

topik.fileio.out_elastic module

class topik.fileio.out_elastic.BaseElasticCorpora(instance, index, corpus_type, query=None, batch_size=1000)[source]

Bases: UserDict.UserDict

class topik.fileio.out_elastic.ElasticSearchOutput(source, index, hash_field=None, doc_type='continuum', query=None, iterable=None, filter_expression='', vectorized_corpora=None, tokenized_corpora=None, modeled_corpora=None, **kwargs)[source]

Bases: topik.fileio.base_output.OutputInterface

convert_date_field_and_reindex(field)[source]
filter_string
get_date_filtered_data(field_to_get, start, end, filter_field='date')[source]
get_filtered_data(field_to_get, filter='')[source]
import_from_iterable(iterable, field_to_hash='text', batch_size=500)[source]

Load data into Elasticsearch from iterable.

iterable: generally a list of dicts, but possibly a list of strings
This is your data. Your dictionary structure defines the schema of the elasticsearch index.
field_to_hash: string identifier of field to hash for content ID. For
list of dicts, a valid key value in the dictionary is required. For list of strings, a dictionary with one key, “text” is created and used.
save(filename, saved_data=None)[source]
synchronize(max_wait, field)[source]
class topik.fileio.out_elastic.ModeledElasticCorpora(instance, index, corpus_type, query=None, batch_size=1000)[source]

Bases: topik.fileio.out_elastic.BaseElasticCorpora

class topik.fileio.out_elastic.VectorizedElasticCorpora(instance, index, corpus_type, query=None, batch_size=1000)[source]

Bases: topik.fileio.out_elastic.BaseElasticCorpora

topik.fileio.out_elastic.es_getitem(key, doc_type, instance, index, query=None)[source]
topik.fileio.out_elastic.es_setitem(key, value, doc_type, instance, index, batch_size=1000)[source]

load an iterable of (id, value) pairs to the specified new or new or existing field within existing documents.

topik.fileio.out_memory module

class topik.fileio.out_memory.GreedyDict(dict=None, **kwargs)[source]

Bases: UserDict.UserDict, object

class topik.fileio.out_memory.InMemoryOutput(iterable=None, hash_field=None, tokenized_corpora=None, vectorized_corpora=None, modeled_corpora=None)[source]

Bases: topik.fileio.base_output.OutputInterface

get_date_filtered_data(field_to_get, start, end, filter_field='year')[source]
get_filtered_data(field_to_get, filter='')[source]
import_from_iterable(iterable, field_to_hash)[source]
iterable: generally a list of dicts, but possibly a list of strings
This is your data. Your dictionary structure defines the schema of the elasticsearch index.
save(filename)[source]

topik.fileio.project module

class topik.fileio.project.TopikProject(project_name, output_type=None, output_args=None, **kwargs)[source]

Bases: object

close()[source]
get_date_filtered_corpus_iterator(start, end, filter_field, field_to_get=None)[source]
get_filtered_corpus_iterator(field=None, filter_expression=None)[source]
read_input(source, content_field, source_type='auto', **kwargs)[source]

Import data from external source into Topik’s internal format

run_model(model_name='lda', ntopics=3, **kwargs)[source]

Analyze vectorized text; determine topics and assign document probabilities

save()[source]

Save project as .topikproject metafile and some number of sidecar data files.

select_modeled_corpus(_id)[source]

When more than one model output available (ran modeling more than once with different methods), this allows you to switch to a different data set.

select_tokenized_corpus(_id)[source]

Assign active tokenized corpus.

When more than one tokenized corpus available (ran tokenization more than once with different methods), this allows you to switch to a different data set.

select_vectorized_corpus(_id)[source]

Assign active vectorized corpus.

When more than one vectorized corpus available (ran tokenization more than once with different methods), this allows you to switch to a different data set.

selected_filtered_corpus

Corpus documents, potentially a subset.

Output from read_input step. Input to tokenization step.

selected_modeled_corpus

matrices representing the model derived.

Output from modeling step. Input to visualization step.

selected_tokenized_corpus

Documents broken into component words. May also be transformed.

Output from tokenization and/or transformation steps. Input to vectorization step.

selected_vectorized_corpus

Data that has been vectorized into term frequencies, TF/IDF, or other vector representation.

Output from vectorization step. Input to modeling step.

tokenize(method='simple', **kwargs)[source]

Break raw text into substituent terms (or collections of terms)

transform(method, **kwargs)[source]

Stem or lemmatize input text that has already been tokenized

vectorize(method='bag_of_words', **kwargs)[source]

Convert tokenized text to vector form - mathematical representation used for modeling.

visualize(vis_name='lda_vis', model_id=None, **kwargs)[source]

Plot model output

topik.fileio.reader module

topik.fileio.reader.read_input(source, source_type='auto', folder_content_field='text', **kwargs)[source]

Read data from given source into Topik’s internal data structures.

Parameters:

source : str

input data. Can be file path, directory, or server address.

source_type : str

“auto” tries to figure out data type of source. Can be manually specified instead. options for manual specification are [‘solr’, ‘elastic’, ‘json_stream’, ‘large_json’, ‘folder’]

folder_content_field : str

Only used for document_folder source. This argument is used as the key (field name), where each document represents the value of that field.

kwargs : any other arguments to pass to input parsers

Returns:

iterable output object

>> ids, texts = zip(*list(iter(raw_data)))

Examples

>>> loaded_corpus = read_input(

... ‘{}/test_data_json_stream.json’.format(test_data_path))

>>> solution_text = (

... u’Transition metal oxides are being considered as the next generation ‘+

... u’materials in field such as electronics and advanced catalysts; ‘+

... u’between them is Tantalum (V) Oxide; however, there are few reports ‘+

... u’for the synthesis of this material at the nanometer size which could ‘+

... u’have unusual properties. Hence, in this work we present the ‘+

... u’synthesis of Ta2O5 nanorods by sol gel method using DNA as structure ‘+

... u’directing agent, the size of the nanorods was of the order of 40 to ‘+

... u‘100 nm in diameter and several microns in length; this easy method ‘+

... u’can be useful in the preparation of nanomaterials for electronics, ‘+

... u’biomedical applications as well as catalysts.’)

>>> solution_text == next(loaded_corpus)['abstract']

True

Module contents