ComptoxAI Databases¶
ComptoxAI relies mainly on two databases:
A graph database, implemented in Neo4j
A feature database, implemented in MongoDB
Briefly, the graph database is designed to show the relationships between entities that are relevant to ComptoxAI, comprising a large, complex network structure. The feature database contains quantitative data tied to the entities that make up the graph database. We separate these into two databases largely for performance reasons - graph databases aren’t especially good at storing large quantities of numerical data for each entity, while relational and NoSQL databases (like MongoDB) don’t provide easy interaction with complex network structures.
The comptox_ai.db.GraphDB
class aims to
make interacting with the two relatively painless. For example, you can extract
a graph from the graph database, and in a single command, fetch the feature
data corresponding to the entities in that graph.
Graph database software is changing constantly. We’re more than willing to consider migrating to a single database solution when a good option is available. If you think you know of a good alternative, let us know on GitHub.
- class comptox_ai.db.GraphDB(username=None, password=None, hostname=None, verbose=False)¶
A Neo4j graph database containing ComptoxAI graph data.
- Parameters:
- verbose: bool, default True
Sets verbosity to on or off. If True, status information will be returned to the user occasionally.
Methods
build_graph_cypher_projection
(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
build_graph_native_projection
(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a native projection.
convert_ids
(node_type, from_id, to_id, ids)Produce a mapping of IDs for a given node type from one terminology / database to another.
Delete all graphs currently stored in the GDS graph catalog.
drop_existing_graph
(graph_name)Delete a single graph from the GDS graph catalog by graph name.
export_graph
(graph_name[, to])Export a graph stored in the GDS graph catalog to a set of CSV files.
fetch
(field, operator, value[, what, ...])Create and execute a query to retrieve nodes, edges, or both.
fetch_chemical_list
(list_name)Fetch all chemicals that are members of a chemical list.
fetch_node_type
(node_label)Fetch an entire class of nodes from the Neo4j graph database.
fetch_nodes
(node_type, property, values)Fetch nodes by node property value.
fetch_relationships
(relationship_type, ...)Fetch edges (relationships) from the Neo4j graph database.
find_node
([name, properties])Find a single node either by name or by property filter(s).
find_nodes
([properties, node_types])Find multiple nodes by node properties and/or labels.
Find relationships by subject/object nodes and/or relationship type.
find_shortest_paths
(node1, node2[, cleaned])- Parameters:
Fetch statistics for the connected graph database.
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Fetch a list of projected subgraphs stored in the GDS graph catalog.
run_cypher
(qry_str[, verbose])Execute a Cypher query on the Neo4j graph database.
stream_named_graph
(graph_name)Stream a named GDS graph into Python for further processing.
- build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
Examples
>>> g = GraphDB() >>> g.build_graph_cypher_projection(...) >>>
- build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a native projection.
- Parameters:
- graph_namestr
- A (string) name for identifying the new graph. If a graph already exists
- with this name, a ValueError will be raised.
- node_projstr, list of str, or dict of
- Node projection for the new graph. This can be either a single node
- label, a list of node labels, or a node projection
Notes
ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components
NODE PROJECTIONS:
(corresponding argument: `node_proj`)
Node projections take the following format:
{
- <node-label-1>: {
label: <neo4j-label>, properties: <node-property-mappings>
}, <node-label-2>: {
label: <neo4j-label>, properties: <node-property-mappings>
}, // … <node-label-n>: {
label: <neo4j-label>, properties: <node-property-mappings>
}
}
where
node-label-i
is a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j),neo4j-label
is a node label to match against in the graph database, andnode-property-mappings
are filters against Neo4j node properties, as defined below.NODE PROPERTY MAPPINGS:
RELATIONSHIP PROJECTIONS:
Examples
>>> g = GraphDB() >>> g.build_graph_native_projection( graph_name = "g1", node_proj = ['Gene', 'StructuralEntity'], relationship_proj = "*" ) >>>
- convert_ids(node_type, from_id, to_id, ids)¶
Produce a mapping of IDs for a given node type from one terminology / database to another.
- Parameters:
- node_typestr
Node type of the entities
- from_idstr
- to_idstr
- idslist of str
- drop_all_existing_graphs()¶
Delete all graphs currently stored in the GDS graph catalog.
- Returns:
- list
- A list of dicts describing the graphs that were dropped as a result of
- calling this method. The dicts follow the same format as one of the list
- elements returned by calling list_current_graphs().
- drop_existing_graph(graph_name)¶
Delete a single graph from the GDS graph catalog by graph name.
- Parameters:
- graph_namestr
- A name of a graph, corresponding to the `’graphName’` field in the
- graph’s entry within the GDS graph catalog.
- Returns:
- dict
- A dict object describing the graph that was dropped as a result of
- calling this method. The dict follows the same format as one of the list
- elements returned by calling list_current_graphs().
- export_graph(graph_name, to='db')¶
Export a graph stored in the GDS graph catalog to a set of CSV files.
- Parameters:
- graph_namestr
- A name of a graph, corresponding to the `’graphName’` field in the
- graph’s entry within the GDS graph catalog.
- fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)¶
Create and execute a query to retrieve nodes, edges, or both.
- Parameters:
- fieldstr
A property label.
- what{‘both’, ‘nodes’, edges’}
The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.
- filterstr
‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).
- query_type{‘cypher’, ‘native’}
Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.
Warning
This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.
- fetch_chemical_list(list_name)¶
Fetch all chemicals that are members of a chemical list.
- Parameters:
- list_namestr
- Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph
- database.
- Returns:
- list_datadict
- Metadata corresponding to the matched list
- chemicalslist of dict
- Chemical nodes that are members of the chemical list
- fetch_node_type(node_label)¶
Fetch an entire class of nodes from the Neo4j graph database.
- Parameters:
- node_labelstr
- Node label corresponding to a class of entities in the database.
- Returns:
- generator of dict
Warning
Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.
- fetch_nodes(node_type, property, values)¶
Fetch nodes by node property value.
Allows users to filter by a single node type (i.e., ontology class).
- Parameters:
- node_typestr
- Node type on which to filter all results. Can speed up queries
- significantly.
- propertystr
- Node property to match against.
- valuesstr or list
- Value or list of values on which to match `property`.
- Returns:
- list of dict
- Each element in the list corresponds to a single node. If no matches are
- found in the database, an empty list will be returned.
- fetch_relationships(relationship_type, from_label, to_label)¶
Fetch edges (relationships) from the Neo4j graph database.
- find_node(name=None, properties=None)¶
Find a single node either by name or by property filter(s).
- find_nodes(properties={}, node_types=[])¶
Find multiple nodes by node properties and/or labels.
- Parameters:
- propertiesdict
- Dict of property values to match in the database query. Each key of
- `properties` should be a (case-sensitive) node property, and each value
- should be the value of that property (case- and type-sensitive).
- node_typeslist of str
- Case sensitive list of strings representing node labels (node types) to
- include in the results. Two or more node types in a single query may
- significantly increase runtime. When multiple node labels are given, the
- results will be the union of all property queries when applied
- Returns:
- generator of dict
- A generator containing dict representations of nodes matching the given
- query.
Notes
The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.
- find_relationships()¶
Find relationships by subject/object nodes and/or relationship type.
- find_shortest_paths(node1, node2, cleaned=True)¶
- Parameters:
- node1comptox
- get_graph_statistics()¶
Fetch statistics for the connected graph database.
This method essentially calls APOC.meta.stats(); and formats the output.
- Returns:
- dict
- Dict of statistics describing the graph database.
- Raises:
- RuntimeError
- If not currently connected to a graph database or the APOC.meta
- procedures are not installed/available.
- get_metagraph()¶
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Notes
We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.
- list_existing_graphs()¶
Fetch a list of projected subgraphs stored in the GDS graph catalog.
- Returns:
- list
- A list of graphs in the GDS graph catalog. If no graphs exist, this will
- be the empty list
[]
.
- run_cypher(qry_str, verbose=True)¶
Execute a Cypher query on the Neo4j graph database.
- Parameters:
- qry_strstr
- A string containing the Cypher query to run on the graph database server.
- Returns:
- list
- The data returned in response to the Cypher query.
Examples
>>> from comptox_ai.db import GraphDB >>> g = GraphDB() >>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;") [{'num_chems': 719599}]
- stream_named_graph(graph_name)¶
Stream a named GDS graph into Python for further processing.
- Parameters:
- graph_namestr
- A name of a graph in the GDS catalog.
- class comptox_ai.db.GraphExporter(db: GraphDB, verbose=True)¶
Exporter for serializing graphs to graph-like formats, meant for consumption by graph-based libraries (like DGL, PyTorch Geometric, networkx, etc.).
Notes
Users shouldn’t usually try to interact with this class directly. They should instead call the appropriate GraphDB method (e.g., db.export()).
Methods
GraphDB
([username, password, hostname, verbose])A Neo4j graph database containing ComptoxAI graph data.
stream_subgraph
(node_types[, relationship_types])Extract a subgraph from the graph database and return it as a Python dictionary object.
- class GraphDB(username=None, password=None, hostname=None, verbose=False)¶
A Neo4j graph database containing ComptoxAI graph data.
- Parameters:
- verbose: bool, default True
Sets verbosity to on or off. If True, status information will be returned to the user occasionally.
Methods
build_graph_cypher_projection
(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
build_graph_native_projection
(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a native projection.
convert_ids
(node_type, from_id, to_id, ids)Produce a mapping of IDs for a given node type from one terminology / database to another.
Delete all graphs currently stored in the GDS graph catalog.
drop_existing_graph
(graph_name)Delete a single graph from the GDS graph catalog by graph name.
export_graph
(graph_name[, to])Export a graph stored in the GDS graph catalog to a set of CSV files.
fetch
(field, operator, value[, what, ...])Create and execute a query to retrieve nodes, edges, or both.
fetch_chemical_list
(list_name)Fetch all chemicals that are members of a chemical list.
fetch_node_type
(node_label)Fetch an entire class of nodes from the Neo4j graph database.
fetch_nodes
(node_type, property, values)Fetch nodes by node property value.
fetch_relationships
(relationship_type, ...)Fetch edges (relationships) from the Neo4j graph database.
find_node
([name, properties])Find a single node either by name or by property filter(s).
find_nodes
([properties, node_types])Find multiple nodes by node properties and/or labels.
Find relationships by subject/object nodes and/or relationship type.
find_shortest_paths
(node1, node2[, cleaned])- Parameters:
Fetch statistics for the connected graph database.
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Fetch a list of projected subgraphs stored in the GDS graph catalog.
run_cypher
(qry_str[, verbose])Execute a Cypher query on the Neo4j graph database.
stream_named_graph
(graph_name)Stream a named GDS graph into Python for further processing.
- build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
Examples
>>> g = GraphDB() >>> g.build_graph_cypher_projection(...) >>>
- build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a native projection.
- Parameters:
- graph_namestr
- A (string) name for identifying the new graph. If a graph already exists
- with this name, a ValueError will be raised.
- node_projstr, list of str, or dict of
- Node projection for the new graph. This can be either a single node
- label, a list of node labels, or a node projection
Notes
ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components
NODE PROJECTIONS:
(corresponding argument: `node_proj`)
Node projections take the following format:
{
- <node-label-1>: {
label: <neo4j-label>, properties: <node-property-mappings>
}, <node-label-2>: {
label: <neo4j-label>, properties: <node-property-mappings>
}, // … <node-label-n>: {
label: <neo4j-label>, properties: <node-property-mappings>
}
}
where
node-label-i
is a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j),neo4j-label
is a node label to match against in the graph database, andnode-property-mappings
are filters against Neo4j node properties, as defined below.NODE PROPERTY MAPPINGS:
RELATIONSHIP PROJECTIONS:
Examples
>>> g = GraphDB() >>> g.build_graph_native_projection( graph_name = "g1", node_proj = ['Gene', 'StructuralEntity'], relationship_proj = "*" ) >>>
- convert_ids(node_type, from_id, to_id, ids)¶
Produce a mapping of IDs for a given node type from one terminology / database to another.
- Parameters:
- node_typestr
Node type of the entities
- from_idstr
- to_idstr
- idslist of str
- drop_all_existing_graphs()¶
Delete all graphs currently stored in the GDS graph catalog.
- Returns:
- list
- A list of dicts describing the graphs that were dropped as a result of
- calling this method. The dicts follow the same format as one of the list
- elements returned by calling list_current_graphs().
- drop_existing_graph(graph_name)¶
Delete a single graph from the GDS graph catalog by graph name.
- Parameters:
- graph_namestr
- A name of a graph, corresponding to the `’graphName’` field in the
- graph’s entry within the GDS graph catalog.
- Returns:
- dict
- A dict object describing the graph that was dropped as a result of
- calling this method. The dict follows the same format as one of the list
- elements returned by calling list_current_graphs().
- export_graph(graph_name, to='db')¶
Export a graph stored in the GDS graph catalog to a set of CSV files.
- Parameters:
- graph_namestr
- A name of a graph, corresponding to the `’graphName’` field in the
- graph’s entry within the GDS graph catalog.
- fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)¶
Create and execute a query to retrieve nodes, edges, or both.
- Parameters:
- fieldstr
A property label.
- what{‘both’, ‘nodes’, edges’}
The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.
- filterstr
‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).
- query_type{‘cypher’, ‘native’}
Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.
Warning
This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.
- fetch_chemical_list(list_name)¶
Fetch all chemicals that are members of a chemical list.
- Parameters:
- list_namestr
- Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph
- database.
- Returns:
- list_datadict
- Metadata corresponding to the matched list
- chemicalslist of dict
- Chemical nodes that are members of the chemical list
- fetch_node_type(node_label)¶
Fetch an entire class of nodes from the Neo4j graph database.
- Parameters:
- node_labelstr
- Node label corresponding to a class of entities in the database.
- Returns:
- generator of dict
Warning
Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.
- fetch_nodes(node_type, property, values)¶
Fetch nodes by node property value.
Allows users to filter by a single node type (i.e., ontology class).
- Parameters:
- node_typestr
- Node type on which to filter all results. Can speed up queries
- significantly.
- propertystr
- Node property to match against.
- valuesstr or list
- Value or list of values on which to match `property`.
- Returns:
- list of dict
- Each element in the list corresponds to a single node. If no matches are
- found in the database, an empty list will be returned.
- fetch_relationships(relationship_type, from_label, to_label)¶
Fetch edges (relationships) from the Neo4j graph database.
- find_node(name=None, properties=None)¶
Find a single node either by name or by property filter(s).
- find_nodes(properties={}, node_types=[])¶
Find multiple nodes by node properties and/or labels.
- Parameters:
- propertiesdict
- Dict of property values to match in the database query. Each key of
- `properties` should be a (case-sensitive) node property, and each value
- should be the value of that property (case- and type-sensitive).
- node_typeslist of str
- Case sensitive list of strings representing node labels (node types) to
- include in the results. Two or more node types in a single query may
- significantly increase runtime. When multiple node labels are given, the
- results will be the union of all property queries when applied
- Returns:
- generator of dict
- A generator containing dict representations of nodes matching the given
- query.
Notes
The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.
- find_relationships()¶
Find relationships by subject/object nodes and/or relationship type.
- find_shortest_paths(node1, node2, cleaned=True)¶
- Parameters:
- node1comptox
- get_graph_statistics()¶
Fetch statistics for the connected graph database.
This method essentially calls APOC.meta.stats(); and formats the output.
- Returns:
- dict
- Dict of statistics describing the graph database.
- Raises:
- RuntimeError
- If not currently connected to a graph database or the APOC.meta
- procedures are not installed/available.
- get_metagraph()¶
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Notes
We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.
- list_existing_graphs()¶
Fetch a list of projected subgraphs stored in the GDS graph catalog.
- Returns:
- list
- A list of graphs in the GDS graph catalog. If no graphs exist, this will
- be the empty list
[]
.
- run_cypher(qry_str, verbose=True)¶
Execute a Cypher query on the Neo4j graph database.
- Parameters:
- qry_strstr
- A string containing the Cypher query to run on the graph database server.
- Returns:
- list
- The data returned in response to the Cypher query.
Examples
>>> from comptox_ai.db import GraphDB >>> g = GraphDB() >>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;") [{'num_chems': 719599}]
- stream_named_graph(graph_name)¶
Stream a named GDS graph into Python for further processing.
- Parameters:
- graph_namestr
- A name of a graph in the GDS catalog.
- stream_subgraph(node_types, relationship_types='all')¶
Extract a subgraph from the graph database and return it as a Python dictionary object.
- Parameters:
- node_typeslist of str
A list of node types to include in the subgraph. Node type names are case-sensitive.
- relationship_typeslist or str
A list of relationship labels (edge types) to include in the subgraph. Alternatively, the string ‘all’ may be used to denote that all relationships in the induced subgraph should be included. Relationship type names are case-sensitive.
- Returns:
- dict
- class comptox_ai.db.Node(db, node_type, search_params, return_first_match=False)¶
A node in ComptoxAI’s graph database.
This class is essentially an immutable dict that is populated at initialization time via a connection to ComptoxAI’s graph database.
- Parameters:
- dbcomptox_ai.db.GraphDB
ComptoxAI graph database in which to perform the node search.
- node_typestr
Node label for the entity type being searched.
- search_paramsdict
Dict of parameters for querying nodes in the database. Exact string matching will be used to query these node properties (i.e., all must be an exact match for the node to be identified). Dict keys are node property names, and dict values are node property values.
- return_first_matchbool
If False, searches that match multiple nodes will raise an exception. Otherwise, the first matching node in query results will be returned (all subsequent matches will be discarded).
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
(key[, default])If the key is not found, return the default if given; otherwise, raise a KeyError.
popitem
(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault
(key[, default])Insert key with a value of default if key is not in the dictionary.
update
([E, ]**F)If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()- clear() None. Remove all items from D. ¶
- copy() a shallow copy of D ¶
- fromkeys(iterable, value=None, /)¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items ¶
- keys() a set-like object providing a view on D's keys ¶
- pop(key, default=<unrepresentable>, /)¶
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem(/)¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault(key, default=None, /)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from dict/iterable E and F. ¶
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D's values ¶