Database Integrated Reading Tools
Here are defined the procedures for reading content on the database, stashing the reading outputs, and producing statements from the readings, and inserting those raw statements into the database.
The Database Readers (indra_db.reading.read_db
)
A reader is defined as a python class which implements the machinery needed to process the text content we store, read it, and extract Statements from the reading results, storing the readings along the way. The reader must conform to a standard interface, which then allows readers to be run in a plug-and-play manner.
This module provides essential tools to run reading using indra’s own
database. This may also be run as a script; for details run:
python read_pmids_db --help
- indra_db.reading.read_db.generate_reading_id(tcid, reader_name, reader_version)[source]
Generate the unique reading ID hash from content ID, reader, and version.
The format of the hash is
AABBCCCCCCCCCC
, whereA
is the placeholder for the reader ID,B
is the placeholder for the reader version integer, andC
is reserved for the text content ID (it is loosely assumed we will not exceed 10^11 pieces of text content).- Parameters:
tcid (str) – The string-ified text content ID.
reader_name (str) – The name of the reader. It must be one of the readers in
readers
.reader_version (str) – The version of the reader, which must be in the list of versions for the given
reader_name
inreader_versions
.
- class indra_db.reading.read_db.DatabaseResultData(result, reading_id=None, db_info_id=None, indra_version=None)[source]
Contains metadata for statements, as well as the statement itself.
This, like ReadingData, is primarily designed for use with the database, carrying valuable information and methods for such.
- Parameters:
result (an indra Result instance) – The result whose extra meta data this object encapsulates.
reading_id (int or None) – The id number of the entry in the readings table of the database. None if no such id is available.
indra_version (str or None) – Override the default indra version, which is the version of indra currently installed.
- class indra_db.reading.read_db.DatabaseMeshRefData(result, reading_id=None, db_info_id=None, indra_version=None)[source]
- class indra_db.reading.read_db.DatabaseReader(tcids, reader, verbose=True, reading_mode='unread', rslt_mode='all', batch_size=1000, db=None, n_proc=1)[source]
An class to run readings utilizing the database.
- Parameters:
tcids (iterable of ints) – An iterable (set, list, tuple, generator, etc) of integers referring to the primary keys of text content in the database.
reader (Reader) – An INDRA Reader object.
verbose (bool) – Optional, default False - If True, log and print the output of the commandline reader utilities, if False, don’t.
reading_mode (str : 'all', 'unread', or 'none') – Optional, default ‘undread’ - If ‘all’, read everything (generally slow); if ‘unread’, only read things that were unread, (the cache of old readings may still be used if rslt_mode=’all’ to get everything); if ‘none’, don’t read, and only retrieve existing readings.
rslt_mode (str : 'all', 'unread', or 'none') – Optional, default ‘all’ - If ‘all’, produce results for all content for all readers. If the readings were already produced, they will be retrieved from the database if read_mode is ‘none’ or ‘unread’. If this option is ‘unread’, only the newly produced readings will be processed. If ‘none’, no rs will be produced.
batch_size (int) – Optional, default 1000 - The number of text content entries to be yielded by the database at a given time.
db (indra_db.DatabaseManager instance) – Optional, default is None, in which case the primary database provided by get_db(‘primary’) function is used. Used to interface with a different database.
- indra_db.reading.read_db.process_content(text_content)[source]
Get the appropriate content object from the text content.
- indra_db.reading.read_db.construct_readers(reader_names, **kwargs)[source]
Construct the Reader objects from the names of the readers.
The Database Script for Running on AWS (indra_db.reading.read_db_aws
)
This is the script used to run reading on AWS Batch, generally run from an AWS Lambda function.
This script is intended to be run on an Amazon ECS container, so information for the job either needs to be provided in environment variables (e.g., the REACH version and path) or loaded from S3 (e.g., the list of PMIDs).
A Class to Manage and Monitor AWS Batch Jobs (indra_db.reading.submitter
)
Allow a manager to monitor the Batch jobs to prevent runaway jobs, and smooth out job runs and submissions.
“This file acts as a script to run large batch jobs on AWS.
The key components are the DbReadingSubmitter class, and the submit_db_reading function. The function is provided as a shallow wrapper for backwards compatibility, and may eventually be removed. The preferred method for running large batches via the ipython, or from a python environment, is the following:
>> sub = DbReadingSubmitter(‘name_for_run’, [‘reach’, ‘sparser’]) >> sub.set_options(prioritize=True) >> sub.submit_reading(‘file/location/of/ids_to_read.txt’, 0, None, ids_per_job=1000) >> sub.watch_and_wait(idle_log_timeout=100, kill_on_timeout=True)
Additionally, this file may be run as a script. For details, run
bash$ python submit_reading_pipeline.py –help
In your favorite command line.