Some Miscellaneous Modules

Here are some modules and files that live on their own, and don’t fit neatly into other categories.

Low Level Database Interface (indra_db.databases)

The Database Manager classes are the lowest level interface with the database, implemented with SQLAlchemy, providing useful short-cuts but also allowing full access to SQLAlchemy’s API.

class indra_db.databases.DatabaseManager(url, label=None, protected=False)[source]

An object used to access INDRA’s database.

This object can be used to access and manage indra’s database. It includes both basic methods and some useful, more high-level methods. It is designed to be used with postgresql, or sqlite.

This object is primarily built around sqlalchemy, which is a required package for its use. It also optionally makes use of the pgcopy package for large data transfers.

If you wish to access the primary database, you can simply use the get_db function to get an instance of this object using the default settings.

Parameters
  • url (str) – The database to which you want to interface.

  • label (OPTIONAL[str]) – A short string to indicate the purpose of the db instance. Set as db_label when initialized with get_db(db_label).

Example

If you wish to access the primary database and find the the metadata for a particular pmid, 1234567:

from indra.db import get_db
db = get_db('primary')
res = db.select_all(db.TextRef, db.TextRef.pmid == '1234567')

You will get a list of objects whose attributes give the metadata contained in the columns of the table.

For more sophisticated examples, several use cases can be found in indra.tests.test_db.

classmethod create_instance(instance_name, size, tag_dict=None)[source]

Allocate the resources on RDS for a database, and return handle.

get_config_string()[source]

Print a config entry for this handle.

This is useful after using create_instance.

get_env_string()[source]

Generate the string for an environment variable.

This is useful after using create_instance.

grab_session()[source]

Get an active session with the database.

get_tables()[source]

Get a list of available tables.

show_tables(active_only=False, schema=None)[source]

Print a list of all the available tables.

get_active_tables(schema=None)[source]

Get the tables currently active in the database.

Parameters

schema (None or st) – The name of the schema whose tables you wish to see. The default is public.

get_schemas()[source]

Return the list of schema names currently in the database.

create_schema(schema_name)[source]

Create a schema with the given name.

drop_schema(schema_name, cascade=True)[source]

Drop a schema (rather forcefully by default)

get_column_names(table)[source]

“Get a list of the column labels for a table.

Note that if the table involves a schema, the schema name must be prepended to the table name.

get_column_objects(table)[source]

Get a list of the column object for the given table.

Note that if the table involves a schema, the schema name must be prepended to the table name.

commit(err_msg)[source]

Commit, and give useful info if there is an exception.

Get the joining clause between two tables, if one exists.

If no link exists, an exception will be raised. Note that this only works for directly links.

get_values(entry_list, col_names=None, keyed=False)[source]

Get the column values from the entries in entry_list

insert(table, ret_info=None, **input_dict)[source]

Insert a an entry into specified table, and return id.

insert_many(table, input_data_list, ret_info=None, cols=None)[source]

Insert many records into the table given by table_name.

delete_all(entry_list)[source]

Remove the given records from the given table.

get_copy_cursor()[source]

Execute SQL queries in the context of a copy operation.

make_copy_batch_id()[source]

Generate a random batch id for copying into the database.

This allows for easy retrieval of the assigned ids immediately after copying in. At this time, only Reading and RawStatements use the feature.

copy_report_lazy(tbl_name, data, cols=None, commit=True, constraint=None, return_cols=None, order_by=None)[source]

Copy lazily, and report what rows were skipped.

copy_detailed_report_lazy(tbl_name, data, inp_cols=None, ret_cols=None, commit=True, constraint=None, skipped_cols=None, order_by=None)[source]

Copy lazily, returning data from some of the columns such as IDs.

copy_lazy(tbl_name, data, cols=None, commit=True, constraint=None)[source]

Copy lazily, skip any rows that violate constraints.

copy_push(tbl_name, data, cols=None, commit=True, constraint=None)[source]

Copy, pushing any changes to constraint violating rows.

copy_report_push(tbl_name, data, cols=None, commit=True, constraint=None, return_cols=None, order_by=None)[source]

Report on the rows skipped when pushing and copying.

copy(tbl_name, data, cols=None, commit=True)[source]

Use pg_copy to copy over a large amount of data.

filter_query(tbls, *args)[source]

Query a table and filter results.

count(tbl, *args)[source]

Get a count of the results to a query.

get_primary_key(tbl)[source]

Get an instance for the primary key column of a given table.

select_one(tbls, *args)[source]

Select the first value that matches requirements.

Requirements are given in kwargs from table indicated by tbl_name. See select_all.

Note that if your specification yields multiple results, this method will just return the first result without exception.

select_all(tbls, *args, **kwargs)[source]

Select any and all entries from table given by tbl_name.

The results will be filtered by your keyword arguments. For example if you want to get a text ref with pmid ‘10532205’, you would call:

db.select_all('text_ref', db.TextRef.pmid == '10532205')

Note that double equals are required, not a single equal. Equivalently you could call:

db.select_all(db.TextRef, db.TextRef.pmid == '10532205')

For a more complicated example, suppose you want to get all text refs that have full text from pmc oa, you could select:

db.select_all(
    [db.TextRef, db.TextContent],
    db.TextContent.text_ref_id == db.TextRef.id,
    db.TextContent.source == 'pmc_oa',
    db.TextContent.text_type == 'fulltext'
    )
Parameters
  • tbls – See above for usage.

  • *args – See above for usage.

  • **kwargs – yield_per: int or None If the result to your query is expected to be large, you can choose to only load yield_per items at a time, using the eponymous feature of sqlalchemy queries. Default is None, meaning all results will be loaded simultaneously.

select_all_batched(batch_size, tbls, *args, skip_idx=None, order_by=None)[source]

Load the results of a query in batches of size batch_size.

Note that this differs from using yeild_per in that the results are not returned as a single iterable, but as an iterator of iterables.

Note also that the order of results, and thus the contents of offsets, may vary for large queries unless an explicit order_by clause is added to the query.

select_sample_from_table(number, table, *args, **kwargs)[source]

Select a number of random samples from the given table.

Parameters
  • number (int) – The number of samples to return

  • table (str, table class, or column attribute of table class) – The table or table column to be sampled.

  • *args – All other arguments are passed to select_all, including any and all filtering clauses.

  • **kwargs – All other arguments are passed to select_all, including any and all filtering clauses.

Return type

A list of sqlalchemy orm objects

has_entry(tbls, *args)[source]

Check whether an entry/entries matching given specs live in the db.

pg_dump(dump_file, **options)[source]

Use the pg_dump command to dump part of the database onto s3.

The pg_dump tool must be installed, and must be a compatible version with the database(s) being used.

All keyword arguments are converted into flags/arguments of pg_dump. For documentation run pg_dump –help. This will also confirm you have pg_dump installed.

By default, the “General” and “Connection” options are already set. The most likely specification you will want to use is –table or –schema, specifying either a particular table or schema to dump.

Parameters

dump_file (S3Path or str) – The location on s3 where the content should be dumped.

pg_restore(dump_file, **options)[source]

Load content into the database from a dump file on s3.

exception indra_db.databases.IndraDbException[source]
indra_db.databases.readers = {'EIDOS': 5, 'ISI': 4, 'MTI': 6, 'REACH': 1, 'SPARSER': 2, 'TRIPS': 3}

A dict mapping each reader a unique integer ID.

These ID’s are used in creating the reading primary ID hashes. Thus, for a new reader to be fully integrated, it must be added to the above dictionary.

indra_db.databases.reader_versions = {'eidos': ['0.2.3-SNAPSHOT', '1.7.1-SNAPSHOT'], 'isi': ['20180503'], 'mti': ['1.0'], 'reach': ['61059a-biores-e9ee36', '1.3.3-61059a-biores-', '1.6.1', '1.6.3-e48717'], 'sparser': ['sept14-linux\n', 'sept14-linux', 'June2018-linux', 'October2018-linux', 'February2020-linux', 'April2020-linux'], 'trips': ['STATIC', '2019Nov14', '2021Jan26']}

A dict of list values keyed by reader name, tracking reader versions.

The oldest versions are to the left, and the newest to the right. We keep track of all past versions as it is often not practical nor necessary to re-run a reading on all content. Even in cases where it is, it is often useful to be able to compare results.

As with the readers variable above, this is used in the creation of the unique hash for a reading entry. For a new reader version to work, it must be added to the appropriate list.

class indra_db.databases.PrincipalDatabaseManager(host, label=None, protected=False)[source]

This class represents the methods special to the principal database.

generate_readonly(belief_dict, allow_continue=True)[source]

Manage the materialized views.

Parameters
  • belief_dict (dict) – The dictionary, keyed by hash, of belief calculated for Statements.

  • allow_continue (bool) – If True (default), continue to build the schema if it already exists. If False, give up if the schema already exists.

dump_readonly(dump_file=None)[source]

Dump the readonly schema to s3.

create_tables(tbl_list=None)[source]

Create the public tables for INDRA database.

drop_tables(tbl_list=None, force=False)[source]

Drop the tables for INDRA database given in tbl_list.

If tbl_list is None, all tables will be dropped. Note that if force is False, a warning prompt will be raised to asking for confirmation, as this action will remove all data from that table.

class indra_db.databases.ReadonlyDatabaseManager(host, label=None, protected=True)[source]

This class represents the readonly database.

get_config_string()[source]

Print a config entry for this handle.

This is useful after using create_instance.

get_source_names() set[source]

Get a list of the source names as they appear in SourceMeta cols.

get_active_tables(schema='readonly')[source]

Get the tables currently active in the database.

Parameters

schema (None or st) – The name of the schema whose tables you wish to see. The default is readonly.

ensure_indices()[source]

Iterates over all the tables and builds indices if they are missing.

When restoring a readonly dump into an instance, some indices may be missing. This function rebuilds missing indices while skipping any existing ones.

load_dump(dump_file, force_clear=True)[source]

Load from a dump of the readonly schema on s3.

Belief Calculator (indra_db.belief)

The belief in the knowledge of a Statement is a measure of our confidence that the Statement is an accurate representation of the text, _NOT_ our confidence in the validity of what was in that text. Given the size of the content in the database, some special care is needed when calculating this value, which depends heavily on the support relations between pre-assembled Statements.

This file contains tools to calculate belief scores for the database.

Scores are calculated using INDRA’s belief engine, with MockStatements and MockEvidence derived from shallow metadata on the database, allowing the entire corpus to be processed locally in RAM, in very little time.

exception indra_db.belief.LoadError[source]
class indra_db.belief.MockEvidence(source_api, **annotations)[source]

A class to imitate real INDRA Evidence for calculating belief.

class indra_db.belief.MockStatement(mk_hash, evidence=None, supports=None, supported_by=None)[source]

A class to imitate real INDRA Statements for calculating belief.

indra_db.belief.load_mock_statements(db, hashes=None, sup_links=None)[source]

Generate a list of mock statements from the pa statement table.

indra_db.belief.populate_support(stmts, links)[source]

Populate the supports supported_by lists of statements given links.

Parameters
  • stmts (list[MockStatement/Statement]) – A list of objects with supports and supported_by attributes which are lists or equivalent.

  • links (list[tuple]) – A list of pairs of hashes or matches_keys, where the first supports the second.