Utilities

Here live the more mundane and backend utilities used throughout other modules of the codebase, and potentially elsewhere, although they are not intended for external use in general. Several more-or-less bespoke scripts are also stored here.

Database Session Constructors (indra_db.util.constructors)

Constructors to get interfaces to the different databases, selecting among the various physical instances defined in the config file.

indra_db.util.constructors.get_db(db_label, protected=False)[source]

Get a db instance base on it’s name in the config or env.

If the label does not exist or the database labeled can’t be reached, None is returned.

indra_db.util.constructors.get_primary_db(force_new=False)[source]

Get a DatabaseManager instance for the primary database host.

The primary database host is defined in the defaults.txt file, or in a file given by the environment variable DEFAULTS_FILE. Alternatively, it may be defined by the INDRADBPRIMARY environment variable. If none of the above are specified, this function will raise an exception.

Note: by default, calling this function twice will return the same DatabaseManager instance. In other words:

db1 = get_primary_db()
db2 = get_primary_db()
db1 is db2

This means also that, for example db1.select_one(db2.TextRef) will work, in the above context.

It is still recommended that when creating a script or function, or other general application, you should not rely on this feature to get your access to the database, as it can make substituting a different database host both complicated and messy. Rather, a database instance should be explicitly passed between different users as is done in get_statements_by_gene_role_type function’s call to get_statements in indra.db.query_db_stmts.

Parameters

force_new (bool) – If true, a new instance will be created and returned, regardless of whether there is an existing instance or not. Default is False, so that if this function has been called before within the global scope, a the instance that was first created will be returned.

Returns

primary_db – An instance of the database manager that is attached to the primary database.

Return type

DatabaseManager

indra_db.util.constructors.get_ro(ro_label, protected=True)[source]

Get a readonly database instance, based on its name.

If the label does not exist or the database labeled can’t be reached, None is returned.

indra_db.util.constructors.get_ro_host(ro_label)[source]

Get the host of the current readonly database.

Scripts to Get Content (indra_db.util.content_scripts)

General scripts for getting content by various IDs.

indra_db.util.content_scripts.get_stmts_with_agent_text_like(pattern, filter_genes=False, db=None)[source]

Get statement ids with agent with rawtext matching pattern

Parameters
  • pattern (str) – a pattern understood by sqlalchemy’s like operator. For example ‘__’ for two letter agents

  • filter_genes (Optional[bool]) – if True, only returns map for agent texts for which there is at least one HGNC grounding in the database. Default: False

  • db (Optional[DatabaseManager]) – User has the option to pass in a database manager. If None the primary database is used. Default: None

Returns

dict mapping agent texts to statement ids. agent text are those matching the input pattern. Each agent text maps to the list of statement ids for statements containing an agent with that TEXT in its db_refs

Return type

dict

indra_db.util.content_scripts.get_text_content_from_stmt_ids(stmt_ids, db=None)[source]

Get text content for statements from a list of ids

Gets the fulltext if it is available, even if the statement came from an abstract.

Parameters
  • stmt_ids (list of str) –

  • db (Optional[DatabaseManager]) – User has the option to pass in a database manager. If None the primary database is used. Default: None

Returns

ref_dict – dict mapping statement ids to identifiers for pieces of content. These identifiers take the form `<text_ref_id>/<source>/<text_type>’. No entries exist for statements with no associated text content (these typically come from databases)

Return type

dict

text_dict: dict

dict mapping content identifiers used as values in the ref_dict to best available text content. The order of preference is fulltext xml > plaintext abstract > title

Distilling Raw Statements (indra_db.util.distill_statements)

Do some pre-pre-assembly cleansing of the raw Statements to account for various kinds of duplicity that are artifacts of our content collection and reading pipelines rather than representing actually duplicated knowledge in the literature.

indra_db.util.distill_statements.delete_raw_statements_by_id(db, raw_sids, sync_session=False, remove='all')[source]

Delete raw statements, their agents, and their raw-unique links.

It is best to batch over this function with sets of 1000 or so ids. Setting sync_session to False will result in a much faster resolution, but you may find some ORM objects have not been updated.

indra_db.util.distill_statements.distill_stmts(db, get_full_stmts=False, clauses=None, handle_duplicates='error')[source]

Get a corpus of statements from clauses and filters duplicate evidence.

Parameters
  • db (DatabaseManager) – A database manager instance to access the database.

  • get_full_stmts (bool) – By default (False), only Statement ids (the primary index of Statements on the database) are returned. However, if set to True, serialized INDRA Statements will be returned. Note that this will in general be VERY large in memory, and therefore should be used with caution.

  • clauses (None or list of sqlalchemy clauses) – By default None. Specify sqlalchemy clauses to reduce the scope of statements, e.g. clauses=[db.Statements.type == ‘Phosphorylation’] or clauses=[db.Statements.uuid.in_([<uuids>])].

  • handle_duplicates ('error', 'delete', or a string file path) – Choose whether you want to delete the statements that are found to be duplicates (‘delete’), or write a pickle file with their ids (at the string file path) for later handling, or raise an exception (‘error’). The default behavior is ‘error’.

Returns

stmt_ret – A set of either statement ids or serialized statements, depending on get_full_stmts.

Return type

set

indra_db.util.distill_statements.get_filtered_db_stmts(db, get_full_stmts=False, clauses=None)[source]

Get the set of statements/ids from databases minus exact duplicates.

indra_db.util.distill_statements.get_filtered_rdg_stmts(stmt_nd, get_full_stmts, linked_sids=None)[source]

Get the set of statements/ids from readings minus exact duplicates.

indra_db.util.distill_statements.get_reading_stmt_dict(db, clauses=None, get_full_stmts=True)[source]

Get a nested dict of statements, keyed by ref, content, and reading.

Script to Create a SIF Dump (indra_db.util.dump_sif)

Create an interactome from metadata in the database and dump the results as a sif file.

indra_db.util.dump_sif.dump_sif(src_count_file, res_pos_file, belief_file, df_file=None, db_res_file=None, csv_file=None, reload=True, reconvert=True, ro=None, normalize_names: bool = True)[source]

Build and dump a sif dataframe of PA statements with grounded agents

Parameters
  • src_count_file (Union[str, S3Path]) – A location to load the source count dict from. Can be local file path, an s3 url string or an S3Path instance.

  • res_pos_file (Union[str, S3Path]) – A location to load the residue-postion dict from. Can be local file path, an s3 url string or an S3Path instance.

  • belief_file (Union[str, S3Path]) – A location to load the belief dict from. Can be local file path, an s3 url string or an S3Path instance.

  • df_file (Optional[Union[str, S3Path]]) – If provided, dump the sif to this location. Can be local file path, an s3 url string or an S3Path instance.

  • db_res_file (Optional[Union[str, S3Path]]) – If provided, save the db content to this location. Can be local file path, an s3 url string or an S3Path instance.

  • csv_file (Optional[str, S3Path]) – If provided, calculate dataframe statistics and save to local file or s3. Can be local file path, an s3 url string or an S3Path instance.

  • reconvert (bool) – Whether to generate a new DataFrame from the database content or to load and return a DataFrame from df_file. If False, df_file must be given. Default: True.

  • reload (bool) – If True, load new content from the database and make a new dataframe. If False, content can be loaded from provided files. Default: True.

  • ro (Optional[PrincipalDatabaseManager]) – Provide a DatabaseManager to load database content from. If not provided, get_ro(‘primary’) will be used.

  • normalize_names – If True, detect and try to merge name duplicates (same entity with different names, e.g. Loratadin vs loratadin). Default: False

indra_db.util.dump_sif.get_source_counts(pkl_filename=None, ro=None)[source]

Returns a dict of dicts with evidence count per source, per statement

The dictionary is at the top level keyed by statement hash and each entry contains a dictionary keyed by the source that support the statement where the entries are the evidence count for that source.

indra_db.util.dump_sif.load_db_content(ns_list, pkl_filename=None, ro=None, reload=False)[source]

Get preassembled stmt metadata from the DB for export.

Queries the NameMeta, TextMeta, and OtherMeta tables as needed to get agent/stmt metadata for agents from the given namespaces.

Parameters
  • ns_list (list of str) – List of agent namespaces to include in the metadata query.

  • pkl_filename (str) – Name of pickle file to save to (if reloading) or load from (if not reloading). If an S3 path is given (i.e., pkl_filename starts with s3:), the file is loaded to/saved from S3. If not given, automatically reloads the content (overriding reload).

  • ro (ReadonlyDatabaseManager) – Readonly database to load the content from. If not given, calls get_ro(‘primary’) to get the primary readonly DB.

  • reload (bool) – Whether to re-query the database for content or to load the content from from pkl_filename. Note that even if reload is False, if no pkl_filename is given, data will be reloaded anyway.

Returns

Set of tuples containing statement information organized by agent. Tuples contain (stmt_hash, agent_ns, agent_id, agent_num, evidence_count, stmt_type).

Return type

set of tuples

indra_db.util.dump_sif.load_res_pos(ro=None)[source]

Return residue/position data keyed by hash

indra_db.util.dump_sif.make_dataframe(reconvert, db_content, res_pos_dict, src_count_dict, belief_dict, pkl_filename=None, normalize_names: bool = False)[source]

Make a pickled DataFrame of the db content, one row per stmt.

Parameters
  • reconvert (bool) – Whether to generate a new DataFrame from the database content or to load and return a DataFrame from the given pickle file. If False, pkl_filename must be given.

  • db_content (set of tuples) – Set of tuples of agent/stmt data as returned by load_db_content.

  • res_pos_dict (Dict[str, Dict[str, str]]) – Dict containing residue and position keyed by hash.

  • src_count_dict (Dict[str, Dict[str, int]]) – Dict of dicts containing source counts per source api keyed by hash.

  • belief_dict (Dict[str, float]) – Dict of belief scores keyed by hash.

  • pkl_filename (str) – Name of pickle file to save to (if reconverting) or load from (if not reconverting). If an S3 path is given (i.e., pkl_filename starts with s3:), the file is loaded to/saved from S3. If not given, reloads the content (overriding reload).

  • normalize_names – If True, detect and try to merge name duplicates (same entity with different names, e.g. Loratadin vs loratadin). Default: False

Returns

DataFrame containing the content, with columns: ‘agA_ns’, ‘agA_id’, ‘agA_name’, ‘agB_ns’, ‘agB_id’, ‘agB_name’, ‘stmt_type’, ‘evidence_count’, ‘stmt_hash’.

Return type

pandas.DataFrame

General Helper Functions (indra_db.util.helpers)

Functions with broad utility throughout the repository, but otherwise miscellaneous.

indra_db.util.helpers.get_raw_stmts_frm_db_list(db, db_stmt_objs, fix_refs=True, with_sids=True)[source]

Convert table objects of raw statements into INDRA Statement objects.

indra_db.util.helpers.get_statement_object(db_stmt)[source]

Get an INDRA Statement object from a db_stmt.

Routines for Inserting Statements and Content (indra_db.util.insert)

Inserting content into the database can be a rather involved process, but here are defined high-level utilities to uniformly accomplish the task.

indra_db.util.insert.extract_agent_data(stmt, stmt_id)[source]

Create the tuples for copying agents into the database.

indra_db.util.insert.insert_db_stmts(db, stmts, db_ref_id, verbose=False, batch_id=None)[source]

Insert statement, their database, and any affiliated agents.

Note that this method is for uploading statements that came from a database to our database, not for inserting any statements to the database.

Parameters
  • db (DatabaseManager) – The manager for the database into which you are loading statements.

  • stmts (list [indra.statements.Statement]) – (Cannot be a generator) A list of un-assembled indra statements, each with EXACTLY one evidence and no exact duplicates, to be uploaded to the database.

  • db_ref_id (int) – The id to the db_ref entry corresponding to these statements.

  • verbose (bool) – If True, print extra information and a status bar while compiling statements for insert. Default False.

  • batch_id (int or None) – Select a batch id to use for this upload. It can be used to trace what content has been added.

indra_db.util.insert.insert_pa_stmts(db, stmts, verbose=False, do_copy=True, ignore_agents=False, commit=True)[source]

Insert pre-assembled statements, and any affiliated agents.

Parameters
  • db (DatabaseManager) – The manager for the database into which you are loading pre-assembled statements.

  • stmts (iterable [indra.statements.Statement]) – A list of pre-assembled indra statements to be uploaded to the datbase.

  • verbose (bool) – If True, print extra information and a status bar while compiling statements for insert. Default False.

  • do_copy (bool) – If True (default), use pgcopy to quickly insert the agents.

  • ignore_agents (bool) – If False (default), add agents to the database. If True, then agent insertion is skipped.

  • commit (bool) – If True (default), commit the result immediately. Otherwise the results are not committed (thus allowing multiple related insertions to be neatly rolled back upon failure.)

indra_db.util.insert.insert_raw_agents(db, batch_id, stmts=None, verbose=False, num_per_yield=100, commit=True)[source]

Insert agents for statements that don’t have any agents.

Parameters
  • db (DatabaseManager) – The manager for the database into which you are adding agents.

  • batch_id (int) – Every set of new raw statements must be given an id unique to that copy That id is used to get the set of statements that need agents added.

  • stmts (list[indra.statements.Statement]) – The list of statements that include those whose agents are being uploaded.

  • verbose (bool) – If True, print extra information and a status bar while compiling agents for insert from statements. Default False.

  • num_per_yield (int) – To conserve memory, statements are loaded in batches of num_per_yeild using the yeild_per feature of sqlalchemy queries.

  • commit (bool) – Optionally do not commit at the end. Default is True, meaning a commit will be executed.

indra_db.util.insert.regularize_agent_id(id_val, id_ns)[source]

Change agent ids for better search-ability and index-ability.