Pipeline Management CLI¶

This module creates a CLI for managing the pipelines used to update content and knowledge in the database, and move or transform that knowledge on a regular basis.

indra-db¶

INDRA Database Infrastructure CLI

The INDRA Database is both a physical database and an infrastructure for managing and updating the content of that physical database. This CLI is used for executing these management commands.

indra-db [OPTIONS] COMMAND [ARGS]...

content¶

Manage the text refs and content on the database.

indra-db content [OPTIONS] COMMAND [ARGS]...

list¶

List the current knowledge sources and their status.

indra-db content list [OPTIONS]

Options

-l, --long¶: Include a list of the most recently added content for all source types.

run¶

Upload/update text refs and content on the database.

Usage tasks are:

- upload: use if the knowledge bases have not yet been added.

- update: if they have been added, but need to be updated.

The currently available sources are “pubmed”, “pmc_oa”, and “manuscripts”.

indra-db content run [OPTIONS] {upload|update}
                     [[pubmed|pmc_oa|manuscripts]]...

Options

-c, --continuing¶: Continue uploading or updating, picking up where you left off.

-d, --debug¶: Run with debugging level output.

Arguments

TASK¶: Required argument

SOURCES¶: Optional argument(s)

dump¶

Manage the data dumps from Principal to files and Readonly.

indra-db dump [OPTIONS] COMMAND [ARGS]...

hierarchy¶

Dump hierarchy of Dumper classes to S3.

indra-db dump hierarchy [OPTIONS]

list¶

List existing dumps and their s3 paths.

State options:

- “started”: get all dumps that have started (have “start.json” in them).
- “done”: get all dumps that have finished (have “end.json” in them).
- “unfinished”: get all dumps that have started but not finished.

If no option is given, all dumps will be listed.

indra-db dump list [OPTIONS] [[started|done|unfinished]]

Arguments

STATE¶: Optional argument

load-readonly¶

Load the readonly database with readonly schema dump.

indra-db dump load-readonly [OPTIONS]

Options

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

--no-redirect-to-principal¶: If given, the lambda function serving the REST API will notbe modified to redirect from the readonly database to theprincipal database while readonly is being loaded.

print-database-stats¶

Print the summary counts for the content on the database.

indra-db dump print-database-stats [OPTIONS]

run¶

Run dumps.

indra-db dump run [OPTIONS] COMMAND [ARGS]...

all¶

Generate new dumps and list existing dumps.

indra-db dump run all [OPTIONS]

Options

-c, --continuing¶: Indicate whether you want the job to continue building an existing dump corpus, or if you want to start a new one.

-d, --dump-only¶: Only generate the dumps on s3.

-l, --load-only¶: Only load a readonly dump from s3 into the given readonly database.

--delete-existing¶: Delete and restart an existing readonly schema in principal.

--no-redirect-to-principal¶: If given, the lambda function serving the REST API will notbe modified to redirect from the readonly database to theprincipal database while readonly is being loaded.

belief¶

Dump a dict of belief scores keyed by hash

indra-db dump run belief [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

end¶

Mark the dump as complete.

indra-db dump run end [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

full-pa-json¶

Dumps all statements found in FastRawPaLink as jsonl

indra-db dump run full-pa-json [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

full-pa-stmts¶

Dumps all statements found in FastRawPaLink as a pickle

indra-db dump run full-pa-stmts [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

mti-mesh-ids¶

Dump a mapping from Statement hashes to MeSH terms.

indra-db dump run mti-mesh-ids [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

principal-statistics¶

Dump a CSV of extensive counts of content in the principal database.

indra-db dump run principal-statistics [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

readonly¶

Generate the readonly schema, and dump it using pgdump.

indra-db dump run readonly [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

res-pos¶

Dumps a dict of dicts with residue/position data from Modifications

indra-db dump run res-pos [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

sif¶

Dumps a pandas dataframe of preassembled statements

indra-db dump run sif [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

source-count¶

Dumps a dict of dicts with source counts per source api per statement

indra-db dump run source-count [OPTIONS]

Options

-c, --continuing¶: Continue a partial dump, if applicable.

-d, --date-stamp <date_stamp>¶: Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.

-f, --force¶: Run the build even if the dump file has already been produced.

--from-dump <from_dump>¶: Indicate a specific start dump from which to build. The default is the most recent.

start¶

Initialize the dump on s3, marking the start datetime of the dump.

indra-db dump run start [OPTIONS]

Options

-c, --continuing¶: Add this flag to only create a new start if an unfinished start does not already exist.

kb¶

Manage the Knowledge Bases used by the database.

indra-db kb [OPTIONS] COMMAND [ARGS]...

list¶

List the knowledge sources and their status.

indra-db kb list [OPTIONS]

run¶

Upload/update the knowledge bases used by the database.

Usage tasks are:

- upload: use if the knowledge bases have not yet been added.

- update: if they have been added, but need to be updated.

Specify which knowledge base sources to update by their name, e.g. “Pathway Commons” or “pc”. If not specified, all sources will be updated.

indra-db kb run [OPTIONS] {upload|update} [SOURCES]...

Arguments

TASK¶: Required argument

SOURCES¶: Optional argument(s)

pa¶

Manage the preassembly pipeline.

indra-db pa [OPTIONS] COMMAND [ARGS]...

list¶

List the latest updates for each type of Statement.

indra-db pa list [OPTIONS]

Options

-r, --with-raw¶: Include the latest datetimes for raw statements of each type. This will take much longer.

run¶

Manage the indra_db preassembly.

Tasks:

- “create”: populate the pa_statements table for the first time (this

requires that the table be empty).

- “update”: update the existing content in pa_statements with the latest

from raw statements.

A project name is required to tag the AWS instances with a “project” tag.

indra-db pa run [OPTIONS] {create|update} [PROJECT_NAME]

Arguments

TASK¶: Required argument

PROJECT_NAME¶: Optional argument

pipeline-stats¶

Manage the pipeline stats gathered on s3.

All major upload and update pipelines have basic timeing and success-failure stats gather on them using the DataGatherer class wrapper.

These stats are displayed on the /monitor endpoint of the database service.

Tasks are:

- gather: gather the individual job JSONs into an aggregated file.

indra-db pipeline-stats [OPTIONS] {gather}

Arguments

TASK¶: Required argument

reading¶

Manage the reading jobs.

indra-db reading [OPTIONS] COMMAND [ARGS]...

list¶

List the readers and their most recent runs.

indra-db reading list [OPTIONS]

run¶

Manage the the reading of text content on AWS.

Tasks:
- “all”: Read all the content available.
- “new”: Read only the new content that has not been read.

indra-db reading run [OPTIONS] {all|new}

Options

-b, --buffer <buffer>¶: Set the number of buffer days to read prior to the most recent update. The default is 1 day.

--project-name <project_name>¶: Set the project name to be different from the config default.

Arguments

TASK¶: Required argument

run-local¶

Run reading locally, save the results on the database.

Tasks:
- “all”: Read all the content available.
- “new”: Read only the new content that has not been read.

indra-db reading run-local [OPTIONS] {all|new}

Options

-b, --buffer <buffer>¶: Set the number of buffer days to read prior to the most recent update. The default is 1 day.

-n, --num-procs <num_procs>¶: Select the number of processors to use.

Arguments

TASK¶: Required argument

xdd¶

Manage xDD runs.

indra-db xdd [OPTIONS] COMMAND [ARGS]...

run¶

Process the latest outputs from xDD.

indra-db xdd run [OPTIONS]

Pipeline CLI Implementations¶

Content (`indra_db.cli.content`)¶

The Content CLI manages the text content that is stored in the database. A parent class is defined, and managers for different sources (e.g. PubMed) can be defined by inheriting from this parent. This file is also used as the shell command to run updates of the content.

exception indra_db.cli.content.UploadError[source]¶

class indra_db.cli.content.ContentManager[source]¶

Abstract class for all upload/update managers.

This abstract class provides the api required for any object that is used to manage content between the database and the content.

upload_text_content(db, data)[source]¶: Insert text content into the database using COPY.

make_text_ref_str(tr)[source]¶: Make a string from a text ref using tr_cols.

add_to_review(desc, msg)[source]¶: Add an entry to the review document.

filter_text_refs(db, tr_data_set, primary_id_types=None)[source]¶

Try to reconcile the data we have with what’s already on the db.

Note that this method is VERY slow in general, and therefore should be avoided whenever possible.

The process can be sped up considerably by multiple orders of magnitude if you specify a limited set of id types to query to get text refs. This does leave some possibility of missing relevant refs.

classmethod get_latest_update(db)[source]¶: Get the date of the latest update.

populate(db)[source]¶: A stub for the method used to initially populate the database.

update(db)[source]¶: A stub for the method used to update the content on the database.

class indra_db.cli.content.Pubmed(*args, categories=None, tables=None, max_annotations=500000, **kwargs)[source]¶

Manager for the pubmed/medline content.

For relevant updates from NCBI on the managemetn and upkeep of the PubMed Abstract FTP server, see here:

https://www.nlm.nih.gov/databases/download/pubmed_medline.html

static fix_doi(doi)[source]¶: Sometimes the doi is doubled (no idea why). Fix it.

load_annotations(db, tr_data)[source]¶: Load annotations into the database.

load_text_refs(db, tr_data, update_existing=False)[source]¶: Sanitize, update old, and upload new text refs.

iter_contents(archives=None)[source]¶

Iterate over the files in the archive, yielding ref and content data.

Parameters

archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.

Yields

label (tuple) – A key representing the particular XML: (XML File Name, Entry Number, Total Entries)
text_ref_dict (dict) – A dictionary containing the text ref information.
text_content_dict (dict) – A dictionary containing the text content information.

load_files(db, files, continuing=False, carefully=False, log_update=True)[source]¶: Load the files in subdirectory indicated by dirname.

dump_annotations(db)[source]¶: Dump all the annotations that have been saved so far.

populate(db, continuing=False)[source]¶

Perform the initial input of the pubmed content into the database.

Parameters

db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.

update(db)[source]¶: Update the contents of the database with the latest articles.

class indra_db.cli.content.PmcManager(*args, **kwargs)[source]¶

Abstract class for uploaders of PMC content: PmcOA and Manuscripts.

update(db)[source]¶: A stub for the method used to update the content on the database.

static get_missing_pmids(db, tr_data)[source]¶: Try to get missing pmids using the pmc client.

filter_text_content(db, tc_data)[source]¶: Filter the text content to identify pre-existing records.

upload_batch(db, tr_data, tc_data)[source]¶: Add a batch of text refs and text content to the database.

get_data_from_xml_str(xml_str, filename)[source]¶: Get the data out of the xml string.

get_license(pmcid)[source]¶: Get the license for this pmcid.

download_archive(archive, continuing=False)[source]¶: Download the archive.

iter_xmls(archives=None, continuing=False, pmcid_set=None)[source]¶

Iterate over the xmls in the given archives.

Parameters

archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.
continuing (Optional[Bool]) – If True, look for locally saved archives to parse, saving the time of downloading.
pmcid_set (Optional[set[str]]) – A set of PMCIDs whose content you want returned from each archive. Many archives are massive repositories with 10s of thousands of papers in each, and only a fraction may need to be returned. Extracting and processing XMLs can be time consuming, so skipping those you don’t need can really pay off!

Yields

label (Tuple) – A key representing the particular XML: (Archive Name, Entry Number, Total Entries)
xml_name (str) – The name of the XML file.
xml_str (str) – The extracted XML string.

iter_contents(archives=None, continuing=False, pmcid_set=None)[source]¶

Iterate over the files in the archive, yielding ref and content data.

Parameters

archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.
continuing (Optional[Bool]) – If True, look for locally saved archives to parse, saving the time of downloading.
pmcid_set (Optional[set[str]]) – A set of PMCIDs whose content you want returned from each archive. Many archives are massive repositories with 10s of thousands of papers in each, and only a fraction may need to be returned. Extracting and processing XMLs can be time consuming, so skipping those you don’t need can really pay off!

Yields

label (tuple) – A key representing the particular XML: (Archive Name, Entry Number, Total Entries)
text_ref_dict (dict) – A dictionary containing the text ref information.
text_content_dict (dict) – A dictionary containing the text content information.

upload_archives(db, archives=None, continuing=False, pmcid_set=None, batch_size=10000)[source]¶

Do the grunt work of downloading and processing a list of archives.

Parameters

db (PrincipalDatabaseManager) – A handle to the principal database.
archives (Optional[Iterable[str]]) – An iterable of archive names from the FTP server.
continuing (bool) – If True, best effort will be made to avoid repeating work already done using some cached files and downloaded archives. If False, it is assumed the caches are empty.
pmcid_set (set[str]) – A set of PMC Ids to include from this list of archives.
batch_size (Optional[int]) – Default is 10,000. The number of pieces of content to submit to the database at a time.

populate(db, continuing=False)[source]¶

Perform the initial population of the pmc content into the database.

Parameters

db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.

Returns

completed – If True, an update was completed. Othewise, the updload was aborted for some reason, often because the upload was already completed at some earlier time.

Return type

bool

get_pmcid_file_dict()[source]¶: Get a dict keyed by PMCID mapping them to file names.

get_csv_files(path)[source]¶: Get a list of CSV files from the FTP server.

class indra_db.cli.content.PmcOA(*args, **kwargs)[source]¶

ContentManager for the pmc open access content.

For further details on the API, see the parent class: PmcManager.

get_license(pmcid)[source]¶: Get the license for this pmcid.

get_file_data()[source]¶: Retrieve the metadata provided by the FTP server for files.

get_archives_after_date(min_date)[source]¶: Get the names of all single-article archives after the given date.

update(db)[source]¶: A stub for the method used to update the content on the database.

find_all_missing_pmcids(db)[source]¶: Find PMCIDs available from the FTP server that are not in the DB.

upload_all_missing_pmcids(db, archives_to_skip=None)[source]¶

This is a special case of update where we upload all missing PMCIDs instead of a regular incremental update.

Parameters

db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
archives_to_skip (list[str] or None) – A list of archives to skip. Processing each archive is time-consuming, so we can skip some archives if we have already processed them. Note that if 100% of the articles from a given archive are already in the database, it will be skipped automatically; this parameter is only used to skip archives that have some articles that could not be uploaded (e.g. because of text ref conflicts, etc.).

class indra_db.cli.content.Manuscripts(*args, **kwargs)[source]¶

ContentManager for the pmc manuscripts.

For further details on the API, see the parent class: PmcManager.

get_license(pmcid)[source]¶: Get the license for this pmcid.

get_file_data()[source]¶: Retrieve the metadata provided by the FTP server for files.

get_tarname_from_filename(fname)[source]¶: Get the name of the tar file based on the file name (or a pmcid).

enrich_textrefs(db)[source]¶: Method to add manuscript_ids to the text refs.

update(db)[source]¶

Add any new content found in the archives.

Note that this is very much the same as populating for manuscripts, as there are no finer grained means of getting manuscripts than just looking through the massive archive files. We do check to see if there are any new listings in each files, minimizing the amount of time downloading and searching, however this will in general be the slowest of the update methods.

The continuing feature isn’t implemented yet.

class indra_db.cli.content.Elsevier(*args, **kwargs)[source]¶

Content manager for maintaining content from Elsevier.

populate(db, n_procs=1, continuing=False)[source]¶: Load all available elsevier content for refs with no pmc content.

update(db, n_procs=1, buffer_days=15)[source]¶: Load all available new elsevier content from new pmids.

Reading (`indra_db.cli.reading`)¶

The Reading CLI handles the reading of the text contend and the processing of those readings into statements. As with Content CLI, different reading pipelines can be handled by defining children of a parent class.

exception indra_db.cli.reading.ReadingUpdateError[source]¶

class indra_db.cli.reading.ReadingManager(reader_names, buffer_days=1, only_unread=False)[source]¶

Abstract class for managing the readings of the database.

Parameters

reader_names (lsit [str]) – A list of the names of the readers to be used in a given run of reading.
buffer_days (int) – The number of days before the previous update/initial upload to look for “new” content to be read. This prevents any issues with overlaps between the content upload pipeline and the reading pipeline.
only_unread (bool) – Only read papers that have not been read (making the determination can be expensive).

static get_latest_updates(db)[source]¶: Get the date of the latest update.

read_all(db, reader_name)[source]¶

Perform an initial reading all content in the database (populate).

This must be defined in a child class.

read_new(db, reader_name)[source]¶

Read only new content (update).

This must be defined in a child class.

class indra_db.cli.reading.BulkReadingManager(reader_names, buffer_days=1, only_unread=False)[source]¶

An abstract class which defines methods required for reading in bulk.

This takes exactly the parameters used by ReadingManager.

read_all(db, reader_name)[source]¶: Read everything available on the database.

read_new(db, reader_name)[source]¶: Update the readings and raw statements in the database.

class indra_db.cli.reading.BulkAwsReadingManager(*args, **kwargs)[source]¶

This is the reading manager when updating using AWS Batch.

This takes all the parameters used by BulkReadingManager, and in addition:

Parameters: project_name (str) – You can select a name for the project for which this reading is being run. This name has a default value set in your config file. The batch jobs used in reading will be tagged with this project name, for accounting purposes.

class indra_db.cli.reading.BulkLocalReadingManager(*args, **kwargs)[source]¶

This is the reading manager to be used when running reading locally.

This takes all the parameters used by BulkReadingManager, and in addition:

Parameters

n_proc (int) – The number of processed to dedicate to reading. Note the some of the readers (e.g. REACH) do not always obey these restrictions.
verbose (bool) – If True, more detailed logs will be printed. Default is False.

PreAssembly (`indra_db.cli.preassembly`)¶

The Preassembly CLI manages the preassembly pipeline, running deploying preassembly jobs to Batch.

indra_db.cli.preassembly.list_last_updates(db)[source]¶: Return a dict of the most recent updates for each statement type.

indra_db.cli.preassembly.list_latest_raw_stmts(db)[source]¶: Return a dict of the most recent new raw statement for each type.

indra_db.cli.preassembly.run_preassembly(mode, project_name)[source]¶

Construct a submitter and begin submitting jobs to Batch for preassembly.

This function will determine which statement types need to be updated and how far back they go, and will create the appropriate PreassemblySubmitter instance, and run the jobs with pre-set parameters on statement types that need updating.

Parameters: project_name (str) – This name is used to gag the various AWS resources used for accounting purposes.

Knowledge Bases (`indra_db.cli.knowledgebase`)¶

The INDRA Databases also derives much of its knowledge from external databases and other resources not extracted from plain text, referred to in this repo as “knowledge bases”, so as to avoid the ambiguity of “database”. This CLI handles the updates of those knowledge bases, each of which requires different handling.

class indra_db.cli.knowledgebase.TasManager[source]¶: This manager handles retrieval and processing of the TAS dataset.

class indra_db.cli.knowledgebase.CBNManager(archive_url=None)[source]¶: This manager handles retrieval and processing of CBN network files

class indra_db.cli.knowledgebase.HPRDManager[source]¶

class indra_db.cli.knowledgebase.SignorManager[source]¶

class indra_db.cli.knowledgebase.BiogridManager[source]¶

class indra_db.cli.knowledgebase.BelLcManager[source]¶

class indra_db.cli.knowledgebase.PathwayCommonsManager(*args, **kwargs)[source]¶

class indra_db.cli.knowledgebase.RlimspManager[source]¶

class indra_db.cli.knowledgebase.TrrustManager[source]¶

class indra_db.cli.knowledgebase.PhosphositeManager[source]¶

class indra_db.cli.knowledgebase.CTDManager[source]¶

class indra_db.cli.knowledgebase.VirHostNetManager[source]¶

class indra_db.cli.knowledgebase.PhosphoElmManager[source]¶

class indra_db.cli.knowledgebase.DrugBankManager[source]¶

Static Dumps (`indra_db.cli.dump`)¶

This handles the generation of static dumps, including the readonly database from the principal database.

indra_db.cli.dump.list_dumps(started=None, ended=None)[source]¶

List all dumps, optionally filtered by their status.

Parameters

started (Optional[bool]) – If True, find dumps that have started. If False, find dumps that have NOT been started. If None, do not filter by start status.
ended (Optional[bool]) – The same as started, but checking whether the dump is ended or not.

Returns

Each S3Path object contains the bucket and key prefix information for a set of dump files, e.g.

[S3Path(bigmech, indra-db/dumps/2020-07-16/),
S3Path(bigmech, indra-db/dumps/2020-08-28/), S3Path(bigmech, indra-db/dumps/2020-09-18/), S3Path(bigmech, indra-db/dumps/2020-11-12/), S3Path(bigmech, indra-db/dumps/2020-11-13/)]

Return type

list of S3Path objects

indra_db.cli.dump.get_latest_dump_s3_path(dumper_name)[source]¶

Get the latest version of a dump file by the given name.

Searches dumps that have already been started and gets the full S3 file path for the latest version of the dump of that type (e.g. “sif”, “belief”, “source_count”, etc.)

Parameters: dumper_name (str) – The standardized name for the dumper classes defined in this module, defined in the name class attribute of the dumper object. E.g., the standard dumper name “sif” can be obtained from Sif.name.
Return type: Union[S3Path, None]

exception indra_db.cli.dump.DumpOrderError[source]¶

class indra_db.cli.dump.Start(*args, **kwargs)[source]¶

Initialize the dump on s3, marking the start datetime of the dump.

load(dump_path)[source]¶: Load manifest from the Start of the given dump path.

classmethod from_date(dump_date: datetime)[source]¶: Select a dump based on the given datetime.

class indra_db.cli.dump.PrincipalStats(start=None, date_stamp=None, **kwargs)[source]¶: Dump a CSV of extensive counts of content in the principal database.

class indra_db.cli.dump.Belief(start=None, date_stamp=None, **kwargs)[source]¶: Dump a dict of belief scores keyed by hash

class indra_db.cli.dump.Readonly(start=None, date_stamp=None, **kwargs)[source]¶: Generate the readonly schema, and dump it using pgdump.

class indra_db.cli.dump.SourceCount(start, use_principal=True, **kwargs)[source]¶: Dumps a dict of dicts with source counts per source api per statement

class indra_db.cli.dump.ResiduePosition(start, use_principal=True, **kwargs)[source]¶: Dumps a dict of dicts with residue/position data from Modifications

class indra_db.cli.dump.FullPaStmts(start, use_principal=False, **kwargs)[source]¶: Dumps all statements found in FastRawPaLink as a pickle

class indra_db.cli.dump.FullPaJson(start, use_principal=False, **kwargs)[source]¶: Dumps all statements found in FastRawPaLink as jsonl

class indra_db.cli.dump.Sif(start, use_principal=False, **kwargs)[source]¶: Dumps a pandas dataframe of preassembled statements

class indra_db.cli.dump.StatementHashMeshId(start, use_principal=False, **kwargs)[source]¶: Dump a mapping from Statement hashes to MeSH terms.

class indra_db.cli.dump.End(start=None, date_stamp=None, **kwargs)[source]¶: Mark the dump as complete.

indra_db.cli.dump.dump(principal_db, readonly_db=None, delete_existing=False, allow_continue=True, load_only=False, dump_only=False, no_redirect_to_principal=True)[source]¶

Run the suite of dumps in the specified order.

Parameters

principal_db (indra_db.databases.PrincipalDatabaseManager) – A handle to the principal database.
readonly_db (indra_db.databases.ReadonlyDatabaseManager) – A handle to the readonly database. Optional when running dump only.
delete_existing (bool) – If True, clear out the existing readonly build from the principal database. Otherwise it will be continued. (Default is False)
allow_continue (bool) – If True, each step will assume that it may already have been done, and where possible the work will be picked up where it was left off. (Default is True)
load_only (bool) – No new dumps will be created, but an existing dump will be used to populate the given readonly database. (Default is False)
dump_only (bool) – Do not load a new readonly database, only produce the dump files on s3. (Default is False)
no_redirect_to_principal (bool) – If False (default), and if we are running without dump_only (i.e., we are also loading a dump into a readonly DB), then we redirect the lambda function driving the REST API to the readonly schema in the principal DB while the readonly DB is being restored. If True, this redirect is not attempted and we assume it is okay if the readonly DB being restored is not accessible for the duration of the load.

indra_db.cli.dump.DumperChild¶: alias of End

Pipeline Management CLI¶

indra-db¶

content¶

list¶

run¶

dump¶

hierarchy¶

list¶

load-readonly¶

print-database-stats¶

run¶

all¶

belief¶

end¶

full-pa-json¶

full-pa-stmts¶

mti-mesh-ids¶

principal-statistics¶

readonly¶

res-pos¶

sif¶

source-count¶

start¶

kb¶

list¶

run¶

pa¶

list¶

run¶

pipeline-stats¶

reading¶

list¶

run¶

run-local¶

xdd¶

run¶

Pipeline CLI Implementations¶

Content (indra_db.cli.content)¶

Reading (indra_db.cli.reading)¶

PreAssembly (indra_db.cli.preassembly)¶

Knowledge Bases (indra_db.cli.knowledgebase)¶

Static Dumps (indra_db.cli.dump)¶

Content (`indra_db.cli.content`)¶

Reading (`indra_db.cli.reading`)¶

PreAssembly (`indra_db.cli.preassembly`)¶

Knowledge Bases (`indra_db.cli.knowledgebase`)¶

Static Dumps (`indra_db.cli.dump`)¶