Data Import

Data import loads your data from some external representation into an iterable, internal representation for Topik. The main front end for importing data is the read_input() function:

>>> from topik import read_input
>>> corpus = read_input(source="./reviews/")

read_input() is a front-end to several potential reader backends. Presently, read_input() attempts to recognize which backend to use based on some characteristics of the source string you pass in. These criteria are:

  • ends with .js or .json: treat as JSON stream filename first, fall back to “large JSON” (such as file generated by esdump).
  • contains 9200: treat as Elasticsearch connection address (9200 is the default Elasticsearch port).
  • result of os.path.splitext(source)[1] is “”: treat as folder of files. Each file is considered raw text, and its contents are stored under the key given by content_field. Files may be gzipped.

Any of the backends can also be forced by passing the source_type argument with one of the following string arguments (see the topik.fileio.registered_inputs dictionary):

  • elastic
  • json_stream
  • large_json
  • folder

JSON additional options

For JSON stream and “large JSON” inputs, an additional keyword argument may be passed, json_prefix, which is the period-separated path leading to the single content field. This is for content fields not at the root level of the JSON document. For example, given the JSON content:

[ {"nested": {"dictionary": {"text": "I am the text you're looking for."} } } ]

You would read using the following json_prefix argument:

>>> corpus = read_input(source="data_file.json", json_prefix="nested.dictionary")

Elasticsearch additional options and notes

The Elasticsearch importer expects a full string specifying the Elasticsearch server. This string at a minimum must contain both the server address and the index to access (if any). All results returned from the Elasticsearch query contain only the contents of the ‘_source’ field returned from the query.

>>> corpus = read_input(source="https://localhost:9200", index="test_index")

Extra arguments passed by keyword are passed to the Elasticsearch instance creation. This can be used to pass additional login parameters, for example, to use SSL:

>>> corpus = read_input(source="https://user:secret@localhost:9200",
                        index="test_index", use_ssl=True)

The source argument for Elasticsearch also supports multiple servers, though this requires that you manually specify the ‘elastic’ source_type:

>>> corpus = read_input(source=["https://server1", "https://server2"],
                        index="test_index", source_type="elastic")

For more information on server options, please refer to Elasticsearch’s documentation.

Extra keyword arguments are also passed to the scroll helper that returns results. Of special note here, an additional query keyword argument can be passed to limit the records imported from the server. This query must follow the Elasticsearch query DSL. For more information on Elasticsearch query DSL, please refer to Elasticsearch’s DSL docs.

>>> query = "{"filtered": {"query": {"match": { "tweet": "full text search"}}}}"
>>> corpus = read_input(source="https://localhost:9200", index="test_index", query=query)

Tracking documents

One important aspect that hasn’t come up here is that documents are tracked by hashing their contents. Projects do this for you automatically:

>>> project = TopikProject("my_project")
>>> project.read_input("./reviews/", content_field="text")

If you do not use Topik’s project feature, then you need to create these id’s yourself. Tokenization and all subsequent steps expect data that has these id’s, with the idea that any future parallelism will use these id’s to keep track of data during and after processing. One way to get id’s is below:

>>> content_field = "text"
>>> raw_data = ((hash(item[content_field]), item[content_field]) for item in corpus)