topik.models package¶
Submodules¶
topik.models.lda module¶
-
class
topik.models.lda.
LDA
(corpus_input=None, ntopics=10, load_filename=None, binary_filename=None, **kwargs)[source]¶ Bases:
topik.models.model_base.TopicModelBase
A high-level interface for an LDA (Latent Dirichlet Allocation) model.
Parameters: corpus_input : CorpusBase-derived object
object fulfilling basic Corpus interface (preprocessed, tokenized text). see topik.intermediaries.tokenized_corpus for more info.
ntopics : int
Number of topics to model
load_filename : None or str
If not None, this (JSON) file is read to determine parameters of the model persisted to disk.
binary_filename : None or str
If not None, this file is loaded by Gensim to bring a disk-persisted model back into memory.
Examples
>>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract") >>> processed_data = raw_data.tokenize() # preprocess returns a DigestedDocumentCollection >>> model = LDA(processed_data, ntopics=3)
Attributes
corpus (CorpusBase-derived object, tokenized) model (Gensim LdaModel instance)
topik.models.model_base module¶
-
class
topik.models.model_base.
TopicModelBase
[source]¶ Bases:
object
Abstract base class for topic models.
Ensures consistent interface across models, for base result display capabilities.
Attributes
_corpus (topik.intermediaries.digested_document_collection.DigestedDocumentCollection-derived object) The input data for this model _persistor (topik.intermediaries.persistor.Persistor object) The object responsible for persisting the state of this model to disk. Persistor saves metadata that instructs load_model how to load the actual data. -
get_model_name_with_parameters
()[source]¶ Abstract method. Primarily internal function, used to name configurations in persisted metadata for later retrieval.
-
get_top_words
(topn)[source]¶ Abstract method. Implementations should collect top n words per topic, translate indices/ids to words.
Returns: list of lists of tuples:
- outer list: topics
- inner lists: length topn collection of (weight, word) tuples
-
save
(filename, saved_data)[source]¶ Abstract method. Persist the model metadata and data to disk.
Implementations should both save their important data do disk with some known keyword (perhaps as filename or server address details), and pass a dictionary to saved_data. The contents of this dictionary will be passed to the class’ constructor as **kwargs.
Be sure to either call super(YourClass, self).save(filename, saved_data) or otherwise duplicate the base level of functionality here.
Parameters: filename : str
The filename of the JSON file to be saved, containing model and corpus metadata that allow for reconstruction
saved_data : dict
Dictionary of metadata that will be fed to class __init__ method at load time. This should include such things as number of topics modeled, binary filenames, and any other relevant model parameters to recreate your current model.
-
termite_data
(topn_words=15)[source]¶ Generate the pandas dataframe input for the termite plot.
Parameters: topn_words : int
number of words to include from each topic
Examples
>>> raw_data = read_input('{}/test_data_json_stream.json'.format(test_data_path), "abstract") >>> processed_data = raw_data.tokenize() # tokenize returns a DigestedDocumentCollection >>> # must set seed so that we get same topics each run >>> import random >>> import numpy >>> random.seed(42) >>> numpy.random.seed(42) >>> model = registered_models["LDA"](processed_data, ntopics=3) >>> model.termite_data(5) topic weight word 0 0 0.005337 nm 1 0 0.005193 high 2 0 0.004622 films 3 0 0.004457 matrix 4 0 0.004194 electron 5 1 0.005109 properties 6 1 0.004654 size 7 1 0.004539 temperature 8 1 0.004499 nm 9 1 0.004248 mechanical 10 2 0.007994 high 11 2 0.006458 nm 12 2 0.005717 size 13 2 0.005399 materials 14 2 0.004734 phase
-