Pipeline Management CLI
This module creates a CLI for managing the pipelines used to update content and knowledge in the database, and move or transform that knowledge on a regular basis.
indra-db
INDRA Database Infrastructure CLI
The INDRA Database is both a physical database and an infrastructure for managing and updating the content of that physical database. This CLI is used for executing these management commands.
indra-db [OPTIONS] COMMAND [ARGS]...
content
Manage the text refs and content on the database.
indra-db content [OPTIONS] COMMAND [ARGS]...
list
List the current knowledge sources and their status.
indra-db content list [OPTIONS]
Options
- -l, --long
Include a list of the most recently added content for all source types.
run
Upload/update text refs and content on the database.
The currently available sources are “pubmed”, “pmc_oa”, and “manuscripts”.
indra-db content run [OPTIONS] {upload|update}
[[pubmed|pmc_oa|manuscripts]]...
Options
- -c, --continuing
Continue uploading or updating, picking up where you left off.
- -d, --debug
Run with debugging level output.
Arguments
- TASK
Required argument
- SOURCES
Optional argument(s)
dump
Manage the data dumps from Principal to files and Readonly.
indra-db dump [OPTIONS] COMMAND [ARGS]...
hierarchy
Dump hierarchy of Dumper classes to S3.
indra-db dump hierarchy [OPTIONS]
list
List existing dumps and their s3 paths.
If no option is given, all dumps will be listed.
indra-db dump list [OPTIONS] [[started|done|unfinished]]
Arguments
- STATE
Optional argument
load-readonly
Load the readonly database with readonly schema dump.
indra-db dump load-readonly [OPTIONS]
Options
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
- --no-redirect-to-principal
If given, the lambda function serving the REST API will notbe modified to redirect from the readonly database to theprincipal database while readonly is being loaded.
print-database-stats
Print the summary counts for the content on the database.
indra-db dump print-database-stats [OPTIONS]
run
Run dumps.
indra-db dump run [OPTIONS] COMMAND [ARGS]...
all
Generate new dumps and list existing dumps.
indra-db dump run all [OPTIONS]
Options
- -c, --continuing
Indicate whether you want the job to continue building an existing dump corpus, or if you want to start a new one.
- -d, --dump-only
Only generate the dumps on s3.
- -l, --load-only
Only load a readonly dump from s3 into the given readonly database.
- --delete-existing
Delete and restart an existing readonly schema in principal.
- --no-redirect-to-principal
If given, the lambda function serving the REST API will notbe modified to redirect from the readonly database to theprincipal database while readonly is being loaded.
belief
Dump a dict of belief scores keyed by hash
indra-db dump run belief [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
end
Mark the dump as complete.
indra-db dump run end [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
full-pa-json
Dumps all statements found in FastRawPaLink as jsonl
indra-db dump run full-pa-json [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
full-pa-stmts
Dumps all statements found in FastRawPaLink as a pickle
indra-db dump run full-pa-stmts [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
mti-mesh-ids
Dump a mapping from Statement hashes to MeSH terms.
indra-db dump run mti-mesh-ids [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
principal-statistics
Dump a CSV of extensive counts of content in the principal database.
indra-db dump run principal-statistics [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
readonly
Generate the readonly schema, and dump it using pgdump.
indra-db dump run readonly [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
res-pos
Dumps a dict of dicts with residue/position data from Modifications
indra-db dump run res-pos [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
sif
Dumps a pandas dataframe of preassembled statements
indra-db dump run sif [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
source-count
Dumps a dict of dicts with source counts per source api per statement
indra-db dump run source-count [OPTIONS]
Options
- -c, --continuing
Continue a partial dump, if applicable.
- -d, --date-stamp <date_stamp>
Provide a datestamp with which to mark this dump. The default is same as the start dump from which this is built.
- -f, --force
Run the build even if the dump file has already been produced.
- --from-dump <from_dump>
Indicate a specific start dump from which to build. The default is the most recent.
start
Initialize the dump on s3, marking the start datetime of the dump.
indra-db dump run start [OPTIONS]
Options
- -c, --continuing
Add this flag to only create a new start if an unfinished start does not already exist.
kb
Manage the Knowledge Bases used by the database.
indra-db kb [OPTIONS] COMMAND [ARGS]...
list
List the knowledge sources and their status.
indra-db kb list [OPTIONS]
run
Upload/update the knowledge bases used by the database.
Specify which knowledge base sources to update by their name, e.g. “Pathway Commons” or “pc”. If not specified, all sources will be updated.
indra-db kb run [OPTIONS] {upload|update} [SOURCES]...
Arguments
- TASK
Required argument
- SOURCES
Optional argument(s)
pa
Manage the preassembly pipeline.
indra-db pa [OPTIONS] COMMAND [ARGS]...
list
List the latest updates for each type of Statement.
indra-db pa list [OPTIONS]
Options
- -r, --with-raw
Include the latest datetimes for raw statements of each type. This will take much longer.
run
Manage the indra_db preassembly.
A project name is required to tag the AWS instances with a “project” tag.
indra-db pa run [OPTIONS] {create|update} [PROJECT_NAME]
Arguments
- TASK
Required argument
- PROJECT_NAME
Optional argument
pipeline-stats
Manage the pipeline stats gathered on s3.
All major upload and update pipelines have basic timeing and success-failure
stats gather on them using the
DataGatherer
class
wrapper.
These stats are displayed on the /monitor
endpoint of the database
service.
indra-db pipeline-stats [OPTIONS] {gather}
Arguments
- TASK
Required argument
reading
Manage the reading jobs.
indra-db reading [OPTIONS] COMMAND [ARGS]...
list
List the readers and their most recent runs.
indra-db reading list [OPTIONS]
run
Manage the the reading of text content on AWS.
indra-db reading run [OPTIONS] {all|new}
Options
- -b, --buffer <buffer>
Set the number of buffer days to read prior to the most recent update. The default is 1 day.
- --project-name <project_name>
Set the project name to be different from the config default.
Arguments
- TASK
Required argument
run-local
Run reading locally, save the results on the database.
indra-db reading run-local [OPTIONS] {all|new}
Options
- -b, --buffer <buffer>
Set the number of buffer days to read prior to the most recent update. The default is 1 day.
- -n, --num-procs <num_procs>
Select the number of processors to use.
Arguments
- TASK
Required argument
xdd
Manage xDD runs.
indra-db xdd [OPTIONS] COMMAND [ARGS]...
run
Process the latest outputs from xDD.
indra-db xdd run [OPTIONS]
Pipeline CLI Implementations
Content (indra_db.cli.content
)
The Content CLI manages the text content that is stored in the database. A parent class is defined, and managers for different sources (e.g. PubMed) can be defined by inheriting from this parent. This file is also used as the shell command to run updates of the content.
- class indra_db.cli.content.ContentManager[source]
Abstract class for all upload/update managers.
This abstract class provides the api required for any object that is used to manage content between the database and the content.
- filter_text_refs(db, tr_data_set, primary_id_types=None)[source]
Try to reconcile the data we have with what’s already on the db.
Note that this method is VERY slow in general, and therefore should be avoided whenever possible.
The process can be sped up considerably by multiple orders of magnitude if you specify a limited set of id types to query to get text refs. This does leave some possibility of missing relevant refs.
- class indra_db.cli.content.Pubmed(*args, categories=None, tables=None, max_annotations=500000, **kwargs)[source]
Manager for the pubmed/medline content.
For relevant updates from NCBI on the managemetn and upkeep of the PubMed Abstract FTP server, see here:
- load_text_refs(db, tr_data, update_existing=False)[source]
Sanitize, update old, and upload new text refs.
- iter_contents(archives=None)[source]
Iterate over the files in the archive, yielding ref and content data.
- Parameters:
archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.
- Yields:
label (tuple) – A key representing the particular XML: (XML File Name, Entry Number, Total Entries)
text_ref_dict (dict) – A dictionary containing the text ref information.
text_content_dict (dict) – A dictionary containing the text content information.
- load_files(db, files, continuing=False, carefully=False, log_update=True)[source]
Load the files in subdirectory indicated by
dirname
.
- populate(db, continuing=False)[source]
Perform the initial input of the pubmed content into the database.
- Parameters:
db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
- class indra_db.cli.content.PmcManager(*args, **kwargs)[source]
Abstract class for uploaders of PMC content: PmcOA and Manuscripts.
- upload_batch(db, tr_data, tc_data)[source]
Add a batch of text refs and text content to the database.
- iter_xmls(archives=None, continuing=False, pmcid_set=None)[source]
Iterate over the xmls in the given archives.
- Parameters:
archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.
continuing (Optional[Bool]) – If True, look for locally saved archives to parse, saving the time of downloading.
pmcid_set (Optional[set[str]]) – A set of PMCIDs whose content you want returned from each archive. Many archives are massive repositories with 10s of thousands of papers in each, and only a fraction may need to be returned. Extracting and processing XMLs can be time consuming, so skipping those you don’t need can really pay off!
- Yields:
label (Tuple) – A key representing the particular XML: (Archive Name, Entry Number, Total Entries)
xml_name (str) – The name of the XML file.
xml_str (str) – The extracted XML string.
- iter_contents(archives=None, continuing=False, pmcid_set=None)[source]
Iterate over the files in the archive, yielding ref and content data.
- Parameters:
archives (Optional[Iterable[str]]) – The names of the archive files from the FTP server to processes. If None, all available archives will be iterated over.
continuing (Optional[Bool]) – If True, look for locally saved archives to parse, saving the time of downloading.
pmcid_set (Optional[set[str]]) – A set of PMCIDs whose content you want returned from each archive. Many archives are massive repositories with 10s of thousands of papers in each, and only a fraction may need to be returned. Extracting and processing XMLs can be time consuming, so skipping those you don’t need can really pay off!
- Yields:
label (tuple) – A key representing the particular XML: (Archive Name, Entry Number, Total Entries)
text_ref_dict (dict) – A dictionary containing the text ref information.
text_content_dict (dict) – A dictionary containing the text content information.
- upload_archives(db, archives=None, continuing=False, pmcid_set=None, batch_size=10000)[source]
Do the grunt work of downloading and processing a list of archives.
- Parameters:
db (
PrincipalDatabaseManager
) – A handle to the principal database.archives (Optional[Iterable[str]]) – An iterable of archive names from the FTP server.
continuing (bool) – If True, best effort will be made to avoid repeating work already done using some cached files and downloaded archives. If False, it is assumed the caches are empty.
pmcid_set (set[str]) – A set of PMC Ids to include from this list of archives.
batch_size (Optional[int]) – Default is 10,000. The number of pieces of content to submit to the database at a time.
- populate(db, continuing=False)[source]
Perform the initial population of the pmc content into the database.
- Parameters:
db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
continuing (bool) – If true, assume that we are picking up after an error, or otherwise continuing from an earlier process. This means we will skip over source files contained in the database. If false, all files will be read and parsed.
- Returns:
completed – If True, an update was completed. Othewise, the updload was aborted for some reason, often because the upload was already completed at some earlier time.
- Return type:
bool
- class indra_db.cli.content.PmcOA(*args, **kwargs)[source]
ContentManager for the pmc open access content.
For further details on the API, see the parent class: PmcManager.
- get_archives_after_date(min_date)[source]
Get the names of all single-article archives after the given date.
- find_all_missing_pmcids(db)[source]
Find PMCIDs available from the FTP server that are not in the DB.
- upload_all_missing_pmcids(db, archives_to_skip=None)[source]
This is a special case of update where we upload all missing PMCIDs instead of a regular incremental update.
- Parameters:
db (indra.db.DatabaseManager instance) – The database to which the data will be uploaded.
archives_to_skip (list[str] or None) – A list of archives to skip. Processing each archive is time-consuming, so we can skip some archives if we have already processed them. Note that if 100% of the articles from a given archive are already in the database, it will be skipped automatically; this parameter is only used to skip archives that have some articles that could not be uploaded (e.g. because of text ref conflicts, etc.).
- class indra_db.cli.content.Manuscripts(*args, **kwargs)[source]
ContentManager for the pmc manuscripts.
For further details on the API, see the parent class: PmcManager.
- get_tarname_from_filename(fname)[source]
Get the name of the tar file based on the file name (or a pmcid).
- update(db)[source]
Add any new content found in the archives.
Note that this is very much the same as populating for manuscripts, as there are no finer grained means of getting manuscripts than just looking through the massive archive files. We do check to see if there are any new listings in each files, minimizing the amount of time downloading and searching, however this will in general be the slowest of the update methods.
The continuing feature isn’t implemented yet.
Reading (indra_db.cli.reading
)
The Reading CLI handles the reading of the text contend and the processing of those readings into statements. As with Content CLI, different reading pipelines can be handled by defining children of a parent class.
- class indra_db.cli.reading.ReadingManager(reader_names, buffer_days=1, only_unread=False)[source]
Abstract class for managing the readings of the database.
- Parameters:
reader_names (lsit [str]) – A list of the names of the readers to be used in a given run of reading.
buffer_days (int) – The number of days before the previous update/initial upload to look for “new” content to be read. This prevents any issues with overlaps between the content upload pipeline and the reading pipeline.
only_unread (bool) – Only read papers that have not been read (making the determination can be expensive).
- class indra_db.cli.reading.BulkReadingManager(reader_names, buffer_days=1, only_unread=False)[source]
An abstract class which defines methods required for reading in bulk.
This takes exactly the parameters used by
ReadingManager
.
- class indra_db.cli.reading.BulkAwsReadingManager(*args, **kwargs)[source]
This is the reading manager when updating using AWS Batch.
This takes all the parameters used by
BulkReadingManager
, and in addition:- Parameters:
project_name (str) – You can select a name for the project for which this reading is being run. This name has a default value set in your config file. The batch jobs used in reading will be tagged with this project name, for accounting purposes.
- class indra_db.cli.reading.BulkLocalReadingManager(*args, **kwargs)[source]
This is the reading manager to be used when running reading locally.
This takes all the parameters used by
BulkReadingManager
, and in addition:- Parameters:
n_proc (int) – The number of processed to dedicate to reading. Note the some of the readers (e.g. REACH) do not always obey these restrictions.
verbose (bool) – If True, more detailed logs will be printed. Default is False.
PreAssembly (indra_db.cli.preassembly
)
The Preassembly CLI manages the preassembly pipeline, running deploying preassembly jobs to Batch.
- indra_db.cli.preassembly.list_last_updates(db)[source]
Return a dict of the most recent updates for each statement type.
- indra_db.cli.preassembly.list_latest_raw_stmts(db)[source]
Return a dict of the most recent new raw statement for each type.
- indra_db.cli.preassembly.run_preassembly(mode, project_name)[source]
Construct a submitter and begin submitting jobs to Batch for preassembly.
This function will determine which statement types need to be updated and how far back they go, and will create the appropriate
PreassemblySubmitter
instance, and run the jobs with pre-set parameters on statement types that need updating.- Parameters:
project_name (str) – This name is used to gag the various AWS resources used for accounting purposes.
Knowledge Bases (indra_db.cli.knowledgebase
)
The INDRA Databases also derives much of its knowledge from external databases and other resources not extracted from plain text, referred to in this repo as “knowledge bases”, so as to avoid the ambiguity of “database”. This CLI handles the updates of those knowledge bases, each of which requires different handling.
- class indra_db.cli.knowledgebase.TasManager[source]
This manager handles retrieval and processing of the TAS dataset.
Static Dumps (indra_db.cli.dump
)
This handles the generation of static dumps, including the readonly database from the principal database.
- indra_db.cli.dump.list_dumps(started=None, ended=None)[source]
List all dumps, optionally filtered by their status.
- Parameters:
started (Optional[bool]) – If True, find dumps that have started. If False, find dumps that have NOT been started. If None, do not filter by start status.
ended (Optional[bool]) – The same as started, but checking whether the dump is ended or not.
- Returns:
Each S3Path object contains the bucket and key prefix information for a set of dump files, e.g.
- [S3Path(bigmech, indra-db/dumps/2020-07-16/),
S3Path(bigmech, indra-db/dumps/2020-08-28/), S3Path(bigmech, indra-db/dumps/2020-09-18/), S3Path(bigmech, indra-db/dumps/2020-11-12/), S3Path(bigmech, indra-db/dumps/2020-11-13/)]
- Return type:
list of S3Path objects
- indra_db.cli.dump.get_latest_dump_s3_path(dumper_name)[source]
Get the latest version of a dump file by the given name.
Searches dumps that have already been started and gets the full S3 file path for the latest version of the dump of that type (e.g. “sif”, “belief”, “source_count”, etc.)
- Parameters:
dumper_name (str) – The standardized name for the dumper classes defined in this module, defined in the name class attribute of the dumper object. E.g., the standard dumper name “sif” can be obtained from
Sif.name
.- Return type:
Union[S3Path, None]
- class indra_db.cli.dump.Start(*args, **kwargs)[source]
Initialize the dump on s3, marking the start datetime of the dump.
- class indra_db.cli.dump.PrincipalStats(start=None, date_stamp=None, **kwargs)[source]
Dump a CSV of extensive counts of content in the principal database.
- class indra_db.cli.dump.Belief(start=None, date_stamp=None, **kwargs)[source]
Dump a dict of belief scores keyed by hash
- class indra_db.cli.dump.Readonly(start=None, date_stamp=None, **kwargs)[source]
Generate the readonly schema, and dump it using pgdump.
- class indra_db.cli.dump.SourceCount(start, use_principal=True, **kwargs)[source]
Dumps a dict of dicts with source counts per source api per statement
- class indra_db.cli.dump.ResiduePosition(start, use_principal=True, **kwargs)[source]
Dumps a dict of dicts with residue/position data from Modifications
- class indra_db.cli.dump.FullPaStmts(start, use_principal=False, **kwargs)[source]
Dumps all statements found in FastRawPaLink as a pickle
- class indra_db.cli.dump.FullPaJson(start, use_principal=False, **kwargs)[source]
Dumps all statements found in FastRawPaLink as jsonl
- class indra_db.cli.dump.Sif(start, use_principal=False, **kwargs)[source]
Dumps a pandas dataframe of preassembled statements
- class indra_db.cli.dump.StatementHashMeshId(start, use_principal=False, **kwargs)[source]
Dump a mapping from Statement hashes to MeSH terms.
- class indra_db.cli.dump.End(start=None, date_stamp=None, **kwargs)[source]
Mark the dump as complete.
- indra_db.cli.dump.dump(principal_db, readonly_db=None, delete_existing=False, allow_continue=True, load_only=False, dump_only=False, no_redirect_to_principal=True)[source]
Run the suite of dumps in the specified order.
- Parameters:
principal_db (
indra_db.databases.PrincipalDatabaseManager
) – A handle to the principal database.readonly_db (
indra_db.databases.ReadonlyDatabaseManager
) – A handle to the readonly database. Optional when running dump only.delete_existing (bool) – If True, clear out the existing readonly build from the principal database. Otherwise it will be continued. (Default is False)
allow_continue (bool) – If True, each step will assume that it may already have been done, and where possible the work will be picked up where it was left off. (Default is True)
load_only (bool) – No new dumps will be created, but an existing dump will be used to populate the given readonly database. (Default is False)
dump_only (bool) – Do not load a new readonly database, only produce the dump files on s3. (Default is False)
no_redirect_to_principal (bool) – If False (default), and if we are running without dump_only (i.e., we are also loading a dump into a readonly DB), then we redirect the lambda function driving the REST API to the readonly schema in the principal DB while the readonly DB is being restored. If True, this redirect is not attempted and we assume it is okay if the readonly DB being restored is not accessible for the duration of the load.