topik.intermediaries package

Submodules

topik.intermediaries.digested_document_collection module

topik.intermediaries.persistence module

This file handles the storage of data from loading and analysis.

More accurately, the files written and read from this file describe how to read/write actual data, such that the actual format of any data need not be tightly defined.

class topik.intermediaries.persistence.Persistor(filename=None)[source]

Bases: object

get_corpus_dict()[source]
get_model_details(model_id)[source]
list_available_models()[source]
load_data(filename)[source]
persist_data(filename)[source]
store_corpus(data_dict)[source]
store_model(model_id, model_dict)[source]

topik.intermediaries.raw_data module

This file is concerned with providing a simple interface for data stored in Elasticsearch. The class(es) defined here are fed into the preprocessing step.

class topik.intermediaries.raw_data.CorpusInterface[source]

Bases: object

append_to_record(record_id, field_name, field_value)[source]

Used to store preprocessed output alongside input data.

Field name is destination. Value is processed value.

classmethod class_key()[source]

Implement this method to return the string ID with which to store your class.

filter_string
get_date_filtered_data(start, end, field)[source]
get_generator_without_id(field=None)[source]

Returns a generator that yields field content without doc_id associate

save(filename, saved_data=None)[source]

Persist this object to disk somehow.

You can save your data in any number of files in any format, but at a minimum, you need one json file that describes enough to bootstrap the loading prcess. Namely, you must have a key called ‘class’ so that upon loading the output, the correct class can be instantiated and used to load any other data. You don’t have to implement anything for saved_data, but it is stored as a key next to ‘class’.

synchronize(max_wait, field)[source]

By default, operations are synchronous and no additional wait is necessary. Data sources that are asynchronous (ElasticSearch) may use this function to wait for “eventual consistency”

tokenize(method='simple', synchronous_wait=30, **kwargs)[source]

Convert data to lowercase; tokenize; create bag of words collection.

Output from this function is used as input to modeling steps.

raw_data: iterable corpus object containing the text to be processed.
Each iteration call should return a new document’s content.
tokenizer_method: string id of tokenizer to use. For keys, see
topik.tokenizers.tokenizer_methods (which is a dictionary of classes)
kwargs: arbitrary dicionary of extra parameters. These are passed both
to the tokenizer and to the vectorizer steps.
class topik.intermediaries.raw_data.DictionaryCorpus(content_field, iterable=None, generate_id=True, reference_field=None, content_filter=None)[source]

Bases: topik.intermediaries.raw_data.CorpusInterface

append_to_record(record_id, field_name, field_value)[source]
classmethod class_key()[source]
filter_string
get_date_filtered_data(start, end, field='year')[source]
get_field(field=None)[source]

Get a different field to iterate over, keeping all other details.

get_generator_without_id(field=None)[source]
import_from_iterable(iterable, content_field, generate_id=True)[source]
iterable: generally a list of dicts, but possibly a list of strings
This is your data. Your dictionary structure defines the schema of the elasticsearch index.
save(filename, saved_data=None)[source]
class topik.intermediaries.raw_data.ElasticSearchCorpus(source, index, content_field, doc_type=None, query=None, iterable=None, filter_expression='', **kwargs)[source]

Bases: topik.intermediaries.raw_data.CorpusInterface

append_to_record(record_id, field_name, field_value)[source]
classmethod class_key()[source]
convert_date_field_and_reindex(field)[source]
filter_string
get_date_filtered_data(start, end, field='date')[source]
get_field(field=None)[source]

Get a different field to iterate over, keeping all other connection details.

get_generator_without_id(field=None)[source]
import_from_iterable(iterable, id_field='text', batch_size=500)[source]

Load data into Elasticsearch from iterable.

iterable: generally a list of dicts, but possibly a list of strings
This is your data. Your dictionary structure defines the schema of the elasticsearch index.
id_field: string identifier of field to hash for content ID. For
list of dicts, a valid key value in the dictionary is required. For list of strings, a dictionary with one key, “text” is created and used.
save(filename, saved_data=None)[source]
synchronize(max_wait, field)[source]
topik.intermediaries.raw_data.load_persisted_corpus(filename)[source]
topik.intermediaries.raw_data.register_output(cls)[source]

Module contents