ComptoxAI Databases

ComptoxAI relies mainly on two databases:

  1. A graph database, implemented in Neo4j

  2. A feature database, implemented in MongoDB

Briefly, the graph database is designed to show the relationships between entities that are relevant to ComptoxAI, comprising a large, complex network structure. The feature database contains quantitative data tied to the entities that make up the graph database. We separate these into two databases largely for performance reasons - graph databases aren’t especially good at storing large quantities of numerical data for each entity, while relational and NoSQL databases (like MongoDB) don’t provide easy interaction with complex network structures.

The comptox_ai.db.GraphDB class aims to make interacting with the two relatively painless. For example, you can extract a graph from the graph database, and in a single command, fetch the feature data corresponding to the entities in that graph.

Graph database software is changing constantly. We’re more than willing to consider migrating to a single database solution when a good option is available. If you think you know of a good alternative, let us know on GitHub.

class comptox_ai.db.GraphDB(username=None, password=None, hostname=None, verbose=False)

A Neo4j graph database containing ComptoxAI graph data.

Parameters:
verbose: bool, default True

Sets verbosity to on or off. If True, status information will be returned to the user occasionally.

Methods

build_graph_cypher_projection(graph_name, ...)

Create a new graph in the Neo4j Graph Catalog via a Cypher projection.

build_graph_native_projection(graph_name, ...)

Create a new graph in the Neo4j Graph Catalog via a native projection.

convert_ids(node_type, from_id, to_id, ids)

Produce a mapping of IDs for a given node type from one terminology / database to another.

drop_all_existing_graphs()

Delete all graphs currently stored in the GDS graph catalog.

drop_existing_graph(graph_name)

Delete a single graph from the GDS graph catalog by graph name.

export_graph(graph_name[, to])

Export a graph stored in the GDS graph catalog to a set of CSV files.

fetch(field, operator, value[, what, ...])

Create and execute a query to retrieve nodes, edges, or both.

fetch_chemical_list(list_name)

Fetch all chemicals that are members of a chemical list.

fetch_node_type(node_label)

Fetch an entire class of nodes from the Neo4j graph database.

fetch_nodes(node_type, property, values)

Fetch nodes by node property value.

fetch_relationships(relationship_type, ...)

Fetch edges (relationships) from the Neo4j graph database.

find_node([name, properties])

Find a single node either by name or by property filter(s).

find_nodes([properties, node_types])

Find multiple nodes by node properties and/or labels.

find_relationships()

Find relationships by subject/object nodes and/or relationship type.

find_shortest_paths(node1, node2[, cleaned])

Parameters:

get_graph_statistics()

Fetch statistics for the connected graph database.

get_metagraph()

Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.

list_existing_graphs()

Fetch a list of projected subgraphs stored in the GDS graph catalog.

run_cypher(qry_str[, verbose])

Execute a Cypher query on the Neo4j graph database.

stream_named_graph(graph_name)

Stream a named GDS graph into Python for further processing.

build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)

Create a new graph in the Neo4j Graph Catalog via a Cypher projection.

Examples

>>> g = GraphDB()
>>> g.build_graph_cypher_projection(...)
>>> 
build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)

Create a new graph in the Neo4j Graph Catalog via a native projection.

Parameters:
graph_namestr
A (string) name for identifying the new graph. If a graph already exists
with this name, a ValueError will be raised.
node_projstr, list of str, or dict of
Node projection for the new graph. This can be either a single node
label, a list of node labels, or a node projection

Notes

ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components

NODE PROJECTIONS:

(corresponding argument: `node_proj`)

Node projections take the following format:

{
<node-label-1>: {

label: <neo4j-label>, properties: <node-property-mappings>

}, <node-label-2>: {

label: <neo4j-label>, properties: <node-property-mappings>

}, // … <node-label-n>: {

label: <neo4j-label>, properties: <node-property-mappings>

}

}

where node-label-i is a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j), neo4j-label is a node label to match against in the graph database, and node-property-mappings are filters against Neo4j node properties, as defined below.

NODE PROPERTY MAPPINGS:

RELATIONSHIP PROJECTIONS:

Examples

>>> g = GraphDB()
>>> g.build_graph_native_projection(
graph_name = "g1",
node_proj = ['Gene', 'StructuralEntity'],
relationship_proj = "*"
)
>>> 
convert_ids(node_type, from_id, to_id, ids)

Produce a mapping of IDs for a given node type from one terminology / database to another.

Parameters:
node_typestr

Node type of the entities

from_idstr
to_idstr
idslist of str
drop_all_existing_graphs()

Delete all graphs currently stored in the GDS graph catalog.

Returns:
list
A list of dicts describing the graphs that were dropped as a result of
calling this method. The dicts follow the same format as one of the list
elements returned by calling list_current_graphs().
drop_existing_graph(graph_name)

Delete a single graph from the GDS graph catalog by graph name.

Parameters:
graph_namestr
A name of a graph, corresponding to the `’graphName’` field in the
graph’s entry within the GDS graph catalog.
Returns:
dict
A dict object describing the graph that was dropped as a result of
calling this method. The dict follows the same format as one of the list
elements returned by calling list_current_graphs().
export_graph(graph_name, to='db')

Export a graph stored in the GDS graph catalog to a set of CSV files.

Parameters:
graph_namestr
A name of a graph, corresponding to the `’graphName’` field in the
graph’s entry within the GDS graph catalog.
fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)

Create and execute a query to retrieve nodes, edges, or both.

Parameters:
fieldstr

A property label.

what{‘both’, ‘nodes’, edges’}

The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.

filterstr

‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).

query_type{‘cypher’, ‘native’}

Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.

Warning

This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.

fetch_chemical_list(list_name)

Fetch all chemicals that are members of a chemical list.

Parameters:
list_namestr
Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph
database.
Returns:
list_datadict
Metadata corresponding to the matched list
chemicalslist of dict
Chemical nodes that are members of the chemical list
fetch_node_type(node_label)

Fetch an entire class of nodes from the Neo4j graph database.

Parameters:
node_labelstr
Node label corresponding to a class of entities in the database.
Returns:
generator of dict

Warning

Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.

fetch_nodes(node_type, property, values)

Fetch nodes by node property value.

Allows users to filter by a single node type (i.e., ontology class).

Parameters:
node_typestr
Node type on which to filter all results. Can speed up queries
significantly.
propertystr
Node property to match against.
valuesstr or list
Value or list of values on which to match `property`.
Returns:
list of dict
Each element in the list corresponds to a single node. If no matches are
found in the database, an empty list will be returned.
fetch_relationships(relationship_type, from_label, to_label)

Fetch edges (relationships) from the Neo4j graph database.

find_node(name=None, properties=None)

Find a single node either by name or by property filter(s).

find_nodes(properties={}, node_types=[])

Find multiple nodes by node properties and/or labels.

Parameters:
propertiesdict
Dict of property values to match in the database query. Each key of
`properties` should be a (case-sensitive) node property, and each value
should be the value of that property (case- and type-sensitive).
node_typeslist of str
Case sensitive list of strings representing node labels (node types) to
include in the results. Two or more node types in a single query may
significantly increase runtime. When multiple node labels are given, the
results will be the union of all property queries when applied
Returns:
generator of dict
A generator containing dict representations of nodes matching the given
query.

Notes

The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.

find_relationships()

Find relationships by subject/object nodes and/or relationship type.

find_shortest_paths(node1, node2, cleaned=True)
Parameters:
node1comptox
get_graph_statistics()

Fetch statistics for the connected graph database.

This method essentially calls APOC.meta.stats(); and formats the output.

Returns:
dict
Dict of statistics describing the graph database.
Raises:
RuntimeError
If not currently connected to a graph database or the APOC.meta
procedures are not installed/available.
get_metagraph()

Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.

Notes

We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.

list_existing_graphs()

Fetch a list of projected subgraphs stored in the GDS graph catalog.

Returns:
list
A list of graphs in the GDS graph catalog. If no graphs exist, this will
be the empty list [].
run_cypher(qry_str, verbose=True)

Execute a Cypher query on the Neo4j graph database.

Parameters:
qry_strstr
A string containing the Cypher query to run on the graph database server.
Returns:
list
The data returned in response to the Cypher query.

Examples

>>> from comptox_ai.db import GraphDB
>>> g = GraphDB()
>>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;")
[{'num_chems': 719599}]
stream_named_graph(graph_name)

Stream a named GDS graph into Python for further processing.

Parameters:
graph_namestr
A name of a graph in the GDS catalog.
class comptox_ai.db.GraphExporter(db: GraphDB, verbose=True)

Exporter for serializing graphs to graph-like formats, meant for consumption by graph-based libraries (like DGL, PyTorch Geometric, networkx, etc.).

Notes

Users shouldn’t usually try to interact with this class directly. They should instead call the appropriate GraphDB method (e.g., db.export()).

Methods

GraphDB([username, password, hostname, verbose])

A Neo4j graph database containing ComptoxAI graph data.

stream_subgraph(node_types[, relationship_types])

Extract a subgraph from the graph database and return it as a Python dictionary object.

class GraphDB(username=None, password=None, hostname=None, verbose=False)

A Neo4j graph database containing ComptoxAI graph data.

Parameters:
verbose: bool, default True

Sets verbosity to on or off. If True, status information will be returned to the user occasionally.

Methods

build_graph_cypher_projection(graph_name, ...)

Create a new graph in the Neo4j Graph Catalog via a Cypher projection.

build_graph_native_projection(graph_name, ...)

Create a new graph in the Neo4j Graph Catalog via a native projection.

convert_ids(node_type, from_id, to_id, ids)

Produce a mapping of IDs for a given node type from one terminology / database to another.

drop_all_existing_graphs()

Delete all graphs currently stored in the GDS graph catalog.

drop_existing_graph(graph_name)

Delete a single graph from the GDS graph catalog by graph name.

export_graph(graph_name[, to])

Export a graph stored in the GDS graph catalog to a set of CSV files.

fetch(field, operator, value[, what, ...])

Create and execute a query to retrieve nodes, edges, or both.

fetch_chemical_list(list_name)

Fetch all chemicals that are members of a chemical list.

fetch_node_type(node_label)

Fetch an entire class of nodes from the Neo4j graph database.

fetch_nodes(node_type, property, values)

Fetch nodes by node property value.

fetch_relationships(relationship_type, ...)

Fetch edges (relationships) from the Neo4j graph database.

find_node([name, properties])

Find a single node either by name or by property filter(s).

find_nodes([properties, node_types])

Find multiple nodes by node properties and/or labels.

find_relationships()

Find relationships by subject/object nodes and/or relationship type.

find_shortest_paths(node1, node2[, cleaned])

Parameters:

get_graph_statistics()

Fetch statistics for the connected graph database.

get_metagraph()

Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.

list_existing_graphs()

Fetch a list of projected subgraphs stored in the GDS graph catalog.

run_cypher(qry_str[, verbose])

Execute a Cypher query on the Neo4j graph database.

stream_named_graph(graph_name)

Stream a named GDS graph into Python for further processing.

build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)

Create a new graph in the Neo4j Graph Catalog via a Cypher projection.

Examples

>>> g = GraphDB()
>>> g.build_graph_cypher_projection(...)
>>> 
build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)

Create a new graph in the Neo4j Graph Catalog via a native projection.

Parameters:
graph_namestr
A (string) name for identifying the new graph. If a graph already exists
with this name, a ValueError will be raised.
node_projstr, list of str, or dict of
Node projection for the new graph. This can be either a single node
label, a list of node labels, or a node projection

Notes

ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components

NODE PROJECTIONS:

(corresponding argument: `node_proj`)

Node projections take the following format:

{
<node-label-1>: {

label: <neo4j-label>, properties: <node-property-mappings>

}, <node-label-2>: {

label: <neo4j-label>, properties: <node-property-mappings>

}, // … <node-label-n>: {

label: <neo4j-label>, properties: <node-property-mappings>

}

}

where node-label-i is a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j), neo4j-label is a node label to match against in the graph database, and node-property-mappings are filters against Neo4j node properties, as defined below.

NODE PROPERTY MAPPINGS:

RELATIONSHIP PROJECTIONS:

Examples

>>> g = GraphDB()
>>> g.build_graph_native_projection(
graph_name = "g1",
node_proj = ['Gene', 'StructuralEntity'],
relationship_proj = "*"
)
>>> 
convert_ids(node_type, from_id, to_id, ids)

Produce a mapping of IDs for a given node type from one terminology / database to another.

Parameters:
node_typestr

Node type of the entities

from_idstr
to_idstr
idslist of str
drop_all_existing_graphs()

Delete all graphs currently stored in the GDS graph catalog.

Returns:
list
A list of dicts describing the graphs that were dropped as a result of
calling this method. The dicts follow the same format as one of the list
elements returned by calling list_current_graphs().
drop_existing_graph(graph_name)

Delete a single graph from the GDS graph catalog by graph name.

Parameters:
graph_namestr
A name of a graph, corresponding to the `’graphName’` field in the
graph’s entry within the GDS graph catalog.
Returns:
dict
A dict object describing the graph that was dropped as a result of
calling this method. The dict follows the same format as one of the list
elements returned by calling list_current_graphs().
export_graph(graph_name, to='db')

Export a graph stored in the GDS graph catalog to a set of CSV files.

Parameters:
graph_namestr
A name of a graph, corresponding to the `’graphName’` field in the
graph’s entry within the GDS graph catalog.
fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)

Create and execute a query to retrieve nodes, edges, or both.

Parameters:
fieldstr

A property label.

what{‘both’, ‘nodes’, edges’}

The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.

filterstr

‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).

query_type{‘cypher’, ‘native’}

Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.

Warning

This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.

fetch_chemical_list(list_name)

Fetch all chemicals that are members of a chemical list.

Parameters:
list_namestr
Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph
database.
Returns:
list_datadict
Metadata corresponding to the matched list
chemicalslist of dict
Chemical nodes that are members of the chemical list
fetch_node_type(node_label)

Fetch an entire class of nodes from the Neo4j graph database.

Parameters:
node_labelstr
Node label corresponding to a class of entities in the database.
Returns:
generator of dict

Warning

Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.

fetch_nodes(node_type, property, values)

Fetch nodes by node property value.

Allows users to filter by a single node type (i.e., ontology class).

Parameters:
node_typestr
Node type on which to filter all results. Can speed up queries
significantly.
propertystr
Node property to match against.
valuesstr or list
Value or list of values on which to match `property`.
Returns:
list of dict
Each element in the list corresponds to a single node. If no matches are
found in the database, an empty list will be returned.
fetch_relationships(relationship_type, from_label, to_label)

Fetch edges (relationships) from the Neo4j graph database.

find_node(name=None, properties=None)

Find a single node either by name or by property filter(s).

find_nodes(properties={}, node_types=[])

Find multiple nodes by node properties and/or labels.

Parameters:
propertiesdict
Dict of property values to match in the database query. Each key of
`properties` should be a (case-sensitive) node property, and each value
should be the value of that property (case- and type-sensitive).
node_typeslist of str
Case sensitive list of strings representing node labels (node types) to
include in the results. Two or more node types in a single query may
significantly increase runtime. When multiple node labels are given, the
results will be the union of all property queries when applied
Returns:
generator of dict
A generator containing dict representations of nodes matching the given
query.

Notes

The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.

find_relationships()

Find relationships by subject/object nodes and/or relationship type.

find_shortest_paths(node1, node2, cleaned=True)
Parameters:
node1comptox
get_graph_statistics()

Fetch statistics for the connected graph database.

This method essentially calls APOC.meta.stats(); and formats the output.

Returns:
dict
Dict of statistics describing the graph database.
Raises:
RuntimeError
If not currently connected to a graph database or the APOC.meta
procedures are not installed/available.
get_metagraph()

Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.

Notes

We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.

list_existing_graphs()

Fetch a list of projected subgraphs stored in the GDS graph catalog.

Returns:
list
A list of graphs in the GDS graph catalog. If no graphs exist, this will
be the empty list [].
run_cypher(qry_str, verbose=True)

Execute a Cypher query on the Neo4j graph database.

Parameters:
qry_strstr
A string containing the Cypher query to run on the graph database server.
Returns:
list
The data returned in response to the Cypher query.

Examples

>>> from comptox_ai.db import GraphDB
>>> g = GraphDB()
>>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;")
[{'num_chems': 719599}]
stream_named_graph(graph_name)

Stream a named GDS graph into Python for further processing.

Parameters:
graph_namestr
A name of a graph in the GDS catalog.
stream_subgraph(node_types, relationship_types='all')

Extract a subgraph from the graph database and return it as a Python dictionary object.

Parameters:
node_typeslist of str

A list of node types to include in the subgraph. Node type names are case-sensitive.

relationship_typeslist or str

A list of relationship labels (edge types) to include in the subgraph. Alternatively, the string ‘all’ may be used to denote that all relationships in the induced subgraph should be included. Relationship type names are case-sensitive.

Returns:
dict
class comptox_ai.db.Node(db, node_type, search_params, return_first_match=False)

A node in ComptoxAI’s graph database.

This class is essentially an immutable dict that is populated at initialization time via a connection to ComptoxAI’s graph database.

Parameters:
dbcomptox_ai.db.GraphDB

ComptoxAI graph database in which to perform the node search.

node_typestr

Node label for the entity type being searched.

search_paramsdict

Dict of parameters for querying nodes in the database. Exact string matching will be used to query these node properties (i.e., all must be an exact match for the node to be identified). Dict keys are node property names, and dict values are node property values.

return_first_matchbool

If False, searches that match multiple nodes will raise an exception. Otherwise, the first matching node in query results will be returned (all subsequent matches will be discarded).

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop(key[, default])

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem(/)

Remove and return a (key, value) pair as a 2-tuple.

setdefault(key[, default])

Insert key with a value of default if key is not in the dictionary.

update([E, ]**F)

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

clear() None.  Remove all items from D.
copy() a shallow copy of D
fromkeys(iterable, value=None, /)

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D's items
keys() a set-like object providing a view on D's keys
pop(key, default=<unrepresentable>, /)

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem(/)

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) None.  Update D from dict/iterable E and F.

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() an object providing a view on D's values