topik.models package¶

Submodules¶

topik.models.lda module¶

class topik.models.lda.LDA(corpus_input=None, ntopics=10, load_filename=None, binary_filename=None, **kwargs)[source]¶

Bases: topik.models.model_base.TopicModelBase

A high-level interface for an LDA (Latent Dirichlet Allocation) model.

Parameters:

corpus_input : CorpusBase-derived object

object fulfilling basic Corpus interface (preprocessed, tokenized text). see topik.intermediaries.tokenized_corpus for more info.

ntopics : int

Number of topics to model

load_filename : None or str

If not None, this (JSON) file is read to determine parameters of the model persisted to disk.

binary_filename : None or str

If not None, this file is loaded by Gensim to bring a disk-persisted model back into memory.

Examples

>>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract")
>>> processed_data = raw_data.tokenize()  # preprocess returns a DigestedDocumentCollection
>>> model = LDA(processed_data, ntopics=3)

Attributes

corpus	(CorpusBase-derived object, tokenized)
model	(Gensim LdaModel instance)

get_model_name_with_parameters()[source]¶

get_top_words(topn)[source]¶

save(filename)[source]¶

topik.models.model_base module¶

class topik.models.model_base.TopicModelBase[source]¶

Bases: object

Abstract base class for topic models.

Ensures consistent interface across models, for base result display capabilities.

Attributes

_corpus	(topik.intermediaries.digested_document_collection.DigestedDocumentCollection-derived object) The input data for this model
_persistor	(topik.intermediaries.persistor.Persistor object) The object responsible for persisting the state of this model to disk. Persistor saves metadata that instructs load_model how to load the actual data.

get_model_name_with_parameters()[source]¶: Abstract method. Primarily internal function, used to name configurations in persisted metadata for later retrieval.

get_top_words(topn)[source]¶

Abstract method. Implementations should collect top n words per topic, translate indices/ids to words.

Returns:

list of lists of tuples:

outer list: topics

inner lists: length topn collection of (weight, word) tuples

save(filename, saved_data)[source]¶

Abstract method. Persist the model metadata and data to disk.

Implementations should both save their important data do disk with some known keyword (perhaps as filename or server address details), and pass a dictionary to saved_data. The contents of this dictionary will be passed to the class’ constructor as **kwargs.

Be sure to either call super(YourClass, self).save(filename, saved_data) or otherwise duplicate the base level of functionality here.

Parameters:

filename : str

The filename of the JSON file to be saved, containing model and corpus metadata that allow for reconstruction

saved_data : dict

Dictionary of metadata that will be fed to class __init__ method at load time. This should include such things as number of topics modeled, binary filenames, and any other relevant model parameters to recreate your current model.

termite_data(topn_words=15)[source]¶

Generate the pandas dataframe input for the termite plot.

Parameters:

topn_words : int

number of words to include from each topic

Examples

>>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract")
>>> processed_data = raw_data.tokenize()  # tokenize returns a DigestedDocumentCollection
>>> # must set seed so that we get same topics each run
>>> import random
>>> import numpy
>>> random.seed(42)
>>> numpy.random.seed(42)
>>> model = registered_models["LDA"](processed_data, ntopics=3)
>>> model.termite_data(5)
    topic    weight         word
0       0  0.005337           nm
1       0  0.005193         high
2       0  0.004622        films
3       0  0.004457       matrix
4       0  0.004194     electron
5       1  0.005109   properties
6       1  0.004654         size
7       1  0.004539  temperature
8       1  0.004499           nm
9       1  0.004248   mechanical
10      2  0.007994         high
11      2  0.006458           nm
12      2  0.005717         size
13      2  0.005399    materials
14      2  0.004734        phase

to_py_lda_vis()[source]¶

topik.models.model_base.load_model(filename, model_name)[source]¶

Loads a JSON file containing instructions on how to load model data.

Returns:	TopicModelBase-derived object

topik.models.model_base.register_model(cls)[source]¶: Decorator function to register new model with global registry of models

topik.models.plsa module¶

class topik.models.plsa.PLSA(corpus=None, ntopics=2, load_filename=None, binary_filename=None)[source]¶

Bases: topik.models.model_base.TopicModelBase

get_model_name_with_parameters()[source]¶

get_top_words(topn)[source]¶

inference(doc, max_iter=100)[source]¶

post_prob_sim(docd, q)[source]¶

save(filename)[source]¶

train(max_iter=100)[source]¶

topik.models package¶

Submodules¶

topik.models.lda module¶

topik.models.model_base module¶

topik.models.plsa module¶

Module contents¶