Database Integrated Reading Tools

Here are defined the procedures for reading content on the database, stashing the reading outputs, and producing statements from the readings, and inserting those raw statements into the database.

The Database Readers (`indra_db.reading.read_db`)

A reader is defined as a python class which implements the machinery needed to process the text content we store, read it, and extract Statements from the reading results, storing the readings along the way. The reader must conform to a standard interface, which then allows readers to be run in a plug-and-play manner.

This module provides essential tools to run reading using indra’s own database. This may also be run as a script; for details run: python read_pmids_db --help

exception indra_db.reading.read_db.ReadDBError[source]

indra_db.reading.read_db.generate_reading_id(tcid, reader_name, reader_version)[source]

Generate the unique reading ID hash from content ID, reader, and version.

The format of the hash is AABBCCCCCCCCCC, where A is the placeholder for the reader ID, B is the placeholder for the reader version integer, and C is reserved for the text content ID (it is loosely assumed we will not exceed 10^11 pieces of text content).

Parameters:

tcid (str) – The string-ified text content ID.
reader_name (str) – The name of the reader. It must be one of the readers in readers.
reader_version (str) – The version of the reader, which must be in the list of versions for the given reader_name in reader_versions.

class indra_db.reading.read_db.DatabaseResultData(result, reading_id=None, db_info_id=None, indra_version=None)[source]

Contains metadata for statements, as well as the statement itself.

This, like ReadingData, is primarily designed for use with the database, carrying valuable information and methods for such.

Parameters:

result (an indra Result instance) – The result whose extra meta data this object encapsulates.
reading_id (int or None) – The id number of the entry in the readings table of the database. None if no such id is available.
indra_version (str or None) – Override the default indra version, which is the version of indra currently installed.

class indra_db.reading.read_db.DatabaseStatementData(*args, **kwargs)[source]

static get_cols()[source]: Get the columns for the tuple returned by make_tuple.

make_tuple(batch_id)[source]: Make a tuple for copying into the database.

class indra_db.reading.read_db.DatabaseMeshRefData(result, reading_id=None, db_info_id=None, indra_version=None)[source]

static get_cols()[source]: Get the columns for the tuple returned by make_tuple.

make_tuple(batch_id)[source]: Make a tuple for copying into the database.

class indra_db.reading.read_db.DatabaseReader(tcids, reader, verbose=True, reading_mode='unread', rslt_mode='all', batch_size=1000, db=None, n_proc=1)[source]

An class to run readings utilizing the database.

Parameters:

tcids (iterable of ints) – An iterable (set, list, tuple, generator, etc) of integers referring to the primary keys of text content in the database.
reader (Reader) – An INDRA Reader object.
verbose (bool) – Optional, default False - If True, log and print the output of the commandline reader utilities, if False, don’t.
reading_mode (str : 'all', 'unread', or 'none') – Optional, default ‘undread’ - If ‘all’, read everything (generally slow); if ‘unread’, only read things that were unread, (the cache of old readings may still be used if rslt_mode=’all’ to get everything); if ‘none’, don’t read, and only retrieve existing readings.
rslt_mode (str : 'all', 'unread', or 'none') – Optional, default ‘all’ - If ‘all’, produce results for all content for all readers. If the readings were already produced, they will be retrieved from the database if read_mode is ‘none’ or ‘unread’. If this option is ‘unread’, only the newly produced readings will be processed. If ‘none’, no rs will be produced.
batch_size (int) – Optional, default 1000 - The number of text content entries to be yielded by the database at a given time.
db (indra_db.DatabaseManager instance) – Optional, default is None, in which case the primary database provided by get_db(‘primary’) function is used. Used to interface with a different database.

dump_readings_to_db()[source]: Put the reading output on the database.

dump_readings_to_pickle(pickle_file)[source]: Dump the reading results into a pickle file.

get_readings()[source]: Get the reading output for the given ids.

dump_results_to_db()[source]: Upload the results to the database.

dump_results_to_pickle(pickle_file)[source]: Dump the results into a pickle file.

get_results()[source]: Convert the reader output into a list of ResultData instances.

make_results(reading_data_list, num_proc=1)[source]: Convert a list of ReadingData instances into ResultData instances.

indra_db.reading.read_db.process_content(text_content)[source]: Get the appropriate content object from the text content.

indra_db.reading.read_db.construct_readers(reader_names, **kwargs)[source]: Construct the Reader objects from the names of the readers.

indra_db.reading.read_db.read(db_reader, rslt_mode, reading_pickle, rslts_pickle, upload_readings, upload_rslts)[source]: Read for a single reader

indra_db.reading.read_db.run_reading(readers, tcids, verbose=True, reading_mode='unread', rslt_mode='all', batch_size=1000, reading_pickle=None, stmts_pickle=None, upload_readings=True, upload_stmts=True, db=None)[source]: Run the reading with the given readers on the given text content ids.

The Database Script for Running on AWS (`indra_db.reading.read_db_aws`)

This is the script used to run reading on AWS Batch, generally run from an AWS Lambda function.

This script is intended to be run on an Amazon ECS container, so information for the job either needs to be provided in environment variables (e.g., the REACH version and path) or loaded from S3 (e.g., the list of PMIDs).

indra_db.reading.read_db_aws.is_trips_datestring(s)[source]: Indicate whether a string has the form of a TRIPS log dir.

A Class to Manage and Monitor AWS Batch Jobs (`indra_db.reading.submitter`)

Allow a manager to monitor the Batch jobs to prevent runaway jobs, and smooth out job runs and submissions.

“This file acts as a script to run large batch jobs on AWS.

The key components are the DbReadingSubmitter class, and the submit_db_reading function. The function is provided as a shallow wrapper for backwards compatibility, and may eventually be removed. The preferred method for running large batches via the ipython, or from a python environment, is the following:

>> sub = DbReadingSubmitter(‘name_for_run’, [‘reach’, ‘sparser’]) >> sub.set_options(prioritize=True) >> sub.submit_reading(‘file/location/of/ids_to_read.txt’, 0, None, ids_per_job=1000) >> sub.watch_and_wait(idle_log_timeout=100, kill_on_timeout=True)

Additionally, this file may be run as a script. For details, run

bash$ python submit_reading_pipeline.py –help

In your favorite command line.

Database Integrated Reading Tools

The Database Readers (indra_db.reading.read_db)

The Database Script for Running on AWS (indra_db.reading.read_db_aws)

A Class to Manage and Monitor AWS Batch Jobs (indra_db.reading.submitter)

The Database Readers (`indra_db.reading.read_db`)

The Database Script for Running on AWS (`indra_db.reading.read_db_aws`)

A Class to Manage and Monitor AWS Batch Jobs (`indra_db.reading.submitter`)