ComptoxAI Databases¶
ComptoxAI relies mainly on two databases:
A graph database, implemented in Memgraph
A feature database, implemented in MongoDB
Briefly, the graph database is designed to show the relationships between entities that are relevant to ComptoxAI, comprising a large, complex network structure. The feature database contains quantitative data tied to the entities that make up the graph database. We separate these into two databases largely for performance reasons - graph databases aren’t especially good at storing large quantities of numerical data for each entity, while relational and NoSQL databases (like MongoDB) don’t provide easy interaction with complex network structures.
The comptox_ai.db.GraphDB class aims to
make interacting with the two relatively painless. For example, you can extract
a graph from the graph database, and in a single command, fetch the feature
data corresponding to the entities in that graph.
Graph database software is changing constantly. We’re more than willing to consider migrating to a single database solution when a good option is available. If you think you know of a good alternative, let us know on GitHub.
- class comptox_ai.db.GraphDB(config_file=None, verbose=False, username=None, password=None, hostname=None)¶
A Neo4j graph database containing ComptoxAI graph data.
- Parameters:
- config_filestr, default None
Relative path to a config file containing a “NEO4J” block, as described below. If None, ComptoxAI will look in the ComptoxAI root directory for either a “CONFIG.cfg” file or “CONFIG-default.cfg”, in that order. If no config file can be found in any of those locations, an exception will be raised.
- verbose: bool, default True
Sets verbosity to on or off. If True, status information will be returned to the user occasionally.
Methods
build_graph_cypher_projection(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
build_graph_native_projection(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a native projection.
Delete all graphs currently stored in the GDS graph catalog.
drop_existing_graph(graph_name)Delete a single graph from the GDS graph catalog by graph name.
export_graph(graph_name[, to])Export a graph stored in the GDS graph catalog to a set of CSV files.
fetch(field, operator, value[, what, ...])Create and execute a query to retrieve nodes, edges, or both.
fetch_chemical_list(list_name)Fetch all chemicals that are members of a chemical list.
fetch_node_type(node_label)Fetch an entire class of nodes from the Neo4j graph database.
fetch_nodes(node_type, property, values)Fetch nodes by node property value.
fetch_relationships(relationship_type, ...)Fetch edges (relationships) from the Neo4j graph database.
find_node([name, properties])Find a single node either by name or by property filter(s).
find_nodes([properties, node_types])Find multiple nodes by node properties and/or labels.
Find relationships by subject/object nodes and/or relationship type.
Fetch statistics for the connected graph database.
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Fetch a list of projected subgraphs stored in the GDS graph catalog.
run_cypher(qry_str[, verbose])Execute a Cypher query on the Neo4j graph database.
stream_named_graph(graph_name)Stream a named GDS graph into Python for further processing.
- build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
Examples
>>> g = GraphDB() >>> g.build_graph_cypher_projection(...) >>>
- build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a native projection.
- Parameters:
- graph_namestr
A (string) name for identifying the new graph. If a graph already exists with this name, a ValueError will be raised.
- node_projstr, list of str, or dict of
Node projection for the new graph. This can be either a single node label, a list of node labels, or a node projection
Notes
ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components
NODE PROJECTIONS:
(corresponding argument: `node_proj`)
Node projections take the following format:
{ <node-label-1>: { label: <neo4j-label>, properties: <node-property-mappings> }, <node-label-2>: { label: <neo4j-label>, properties: <node-property-mappings> }, // ... <node-label-n>: { label: <neo4j-label>, properties: <node-property-mappings> } }
where
node-label-iis a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j),neo4j-labelis a node label to match against in the graph database, andnode-property-mappingsare filters against Neo4j node properties, as defined below.NODE PROPERTY MAPPINGS:
RELATIONSHIP PROJECTIONS:
Examples
>>> g = GraphDB() >>> g.build_graph_native_projection( graph_name = "g1", node_proj = ['Gene', 'StructuralEntity'], relationship_proj = "*" ) >>>
- drop_all_existing_graphs()¶
Delete all graphs currently stored in the GDS graph catalog.
- Returns:
- list
A list of dicts describing the graphs that were dropped as a result of calling this method. The dicts follow the same format as one of the list elements returned by calling list_current_graphs().
- drop_existing_graph(graph_name)¶
Delete a single graph from the GDS graph catalog by graph name.
- Parameters:
- graph_namestr
A name of a graph, corresponding to the ‘graphName’ field in the graph’s entry within the GDS graph catalog.
- Returns:
- dict
A dict object describing the graph that was dropped as a result of calling this method. The dict follows the same format as one of the list elements returned by calling list_current_graphs().
- export_graph(graph_name, to='db')¶
Export a graph stored in the GDS graph catalog to a set of CSV files.
- Parameters:
- graph_namestr
A name of a graph, corresponding to the ‘graphName’ field in the graph’s entry within the GDS graph catalog.
- fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)¶
Create and execute a query to retrieve nodes, edges, or both.
- Parameters:
- fieldstr
A property label.
- what{‘both’, ‘nodes’, edges’}
The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.
- filterstr
‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).
- query_type{‘cypher’, ‘native’}
Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.
Warning
This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.
- fetch_chemical_list(list_name)¶
Fetch all chemicals that are members of a chemical list.
- Parameters:
- list_namestr
Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph database.
- Returns:
- list_datadict
Metadata corresponding to the matched list
- chemicalslist of dict
Chemical nodes that are members of the chemical list
- fetch_node_type(node_label)¶
Fetch an entire class of nodes from the Neo4j graph database.
- Parameters:
- node_labelstr
Node label corresponding to a class of entities in the database.
- Returns:
- generator of dict
Warning
Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.
- fetch_nodes(node_type, property, values)¶
Fetch nodes by node property value.
Allows users to filter by a single node type (i.e., ontology class).
- Parameters:
- node_typestr
Node type on which to filter all results. Can speed up queries significantly.
- propertystr
Node property to match against.
- valuesstr or list
Value or list of values on which to match property.
- Returns:
- list of dict
Each element in the list corresponds to a single node. If no matches are found in the database, an empty list will be returned.
- fetch_relationships(relationship_type, from_label, to_label)¶
Fetch edges (relationships) from the Neo4j graph database.
- find_node(name=None, properties=None)¶
Find a single node either by name or by property filter(s).
- find_nodes(properties={}, node_types=[])¶
Find multiple nodes by node properties and/or labels.
- Parameters:
- propertiesdict
Dict of property values to match in the database query. Each key of properties should be a (case-sensitive) node property, and each value should be the value of that property (case- and type-sensitive).
- node_typeslist of str
Case sensitive list of strings representing node labels (node types) to include in the results. Two or more node types in a single query may significantly increase runtime. When multiple node labels are given, the results will be the union of all property queries when applied
- Returns:
- generator of dict
A generator containing dict representations of nodes matching the given query.
Notes
The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.
- find_relationships()¶
Find relationships by subject/object nodes and/or relationship type.
- get_graph_statistics()¶
Fetch statistics for the connected graph database.
This method essentially calls APOC.meta.stats(); and formats the output.
- Returns:
- dict
Dict of statistics describing the graph database.
- Raises:
- RuntimeError
If not currently connected to a graph database or the APOC.meta procedures are not installed/available.
- get_metagraph()¶
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Notes
We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.
- list_existing_graphs()¶
Fetch a list of projected subgraphs stored in the GDS graph catalog.
- Returns:
- list
A list of graphs in the GDS graph catalog. If no graphs exist, this will be the empty list
[].
- run_cypher(qry_str, verbose=True)¶
Execute a Cypher query on the Neo4j graph database.
The
- Parameters:
- qry_strstr
A string containing the Cypher query to run on the graph database server.
- Returns:
- list
The data returned in response to the Cypher query.
Examples
>>> from comptox_ai.db import GraphDB >>> g = GraphDB() >>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;") [{'num_chems': 719599}]
- stream_named_graph(graph_name)¶
Stream a named GDS graph into Python for further processing.
- Parameters:
- graph_namestr
A name of a graph in the GDS catalog.
- class comptox_ai.db.GraphExporter(db: GraphDB, verbose=True)¶
Exporter for serializing graphs to graph-like formats, meant for consumption by graph-based libraries (like DGL, PyTorch Geometric, networkx, etc.).
Notes
Users shouldn’t usually try to interact with this class directly. They should instead call the appropriate GraphDB method (e.g., db.export()).
Methods
GraphDB([config_file, verbose, username, ...])A Neo4j graph database containing ComptoxAI graph data.
stream_subgraph(node_types[, relationship_types])Extract a subgraph from the graph database and return it as a Python dictionary object.
- class GraphDB(config_file=None, verbose=False, username=None, password=None, hostname=None)¶
A Neo4j graph database containing ComptoxAI graph data.
- Parameters:
- config_filestr, default None
Relative path to a config file containing a “NEO4J” block, as described below. If None, ComptoxAI will look in the ComptoxAI root directory for either a “CONFIG.cfg” file or “CONFIG-default.cfg”, in that order. If no config file can be found in any of those locations, an exception will be raised.
- verbose: bool, default True
Sets verbosity to on or off. If True, status information will be returned to the user occasionally.
Methods
build_graph_cypher_projection(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
build_graph_native_projection(graph_name, ...)Create a new graph in the Neo4j Graph Catalog via a native projection.
Delete all graphs currently stored in the GDS graph catalog.
drop_existing_graph(graph_name)Delete a single graph from the GDS graph catalog by graph name.
export_graph(graph_name[, to])Export a graph stored in the GDS graph catalog to a set of CSV files.
fetch(field, operator, value[, what, ...])Create and execute a query to retrieve nodes, edges, or both.
fetch_chemical_list(list_name)Fetch all chemicals that are members of a chemical list.
fetch_node_type(node_label)Fetch an entire class of nodes from the Neo4j graph database.
fetch_nodes(node_type, property, values)Fetch nodes by node property value.
fetch_relationships(relationship_type, ...)Fetch edges (relationships) from the Neo4j graph database.
find_node([name, properties])Find a single node either by name or by property filter(s).
find_nodes([properties, node_types])Find multiple nodes by node properties and/or labels.
Find relationships by subject/object nodes and/or relationship type.
Fetch statistics for the connected graph database.
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Fetch a list of projected subgraphs stored in the GDS graph catalog.
run_cypher(qry_str[, verbose])Execute a Cypher query on the Neo4j graph database.
stream_named_graph(graph_name)Stream a named GDS graph into Python for further processing.
- build_graph_cypher_projection(graph_name, node_query, relationship_query, config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a Cypher projection.
Examples
>>> g = GraphDB() >>> g.build_graph_cypher_projection(...) >>>
- build_graph_native_projection(graph_name, node_types, relationship_types='all', config_dict=None)¶
Create a new graph in the Neo4j Graph Catalog via a native projection.
- Parameters:
- graph_namestr
A (string) name for identifying the new graph. If a graph already exists with this name, a ValueError will be raised.
- node_projstr, list of str, or dict of
Node projection for the new graph. This can be either a single node label, a list of node labels, or a node projection
Notes
ComptoxAI is meant to hide the implementation and usage details of graph databases from the user, but some advanced features do expose the syntax used in the Neo4j and MongoDB internals. This is especially true when building graph projections in the graph catalog. The following components
NODE PROJECTIONS:
(corresponding argument: `node_proj`)
Node projections take the following format:
{ <node-label-1>: { label: <neo4j-label>, properties: <node-property-mappings> }, <node-label-2>: { label: <neo4j-label>, properties: <node-property-mappings> }, // ... <node-label-n>: { label: <neo4j-label>, properties: <node-property-mappings> } }
where
node-label-iis a name for a node label in the projected graph (it can be the same as or different from the label already in neo4j),neo4j-labelis a node label to match against in the graph database, andnode-property-mappingsare filters against Neo4j node properties, as defined below.NODE PROPERTY MAPPINGS:
RELATIONSHIP PROJECTIONS:
Examples
>>> g = GraphDB() >>> g.build_graph_native_projection( graph_name = "g1", node_proj = ['Gene', 'StructuralEntity'], relationship_proj = "*" ) >>>
- drop_all_existing_graphs()¶
Delete all graphs currently stored in the GDS graph catalog.
- Returns:
- list
A list of dicts describing the graphs that were dropped as a result of calling this method. The dicts follow the same format as one of the list elements returned by calling list_current_graphs().
- drop_existing_graph(graph_name)¶
Delete a single graph from the GDS graph catalog by graph name.
- Parameters:
- graph_namestr
A name of a graph, corresponding to the ‘graphName’ field in the graph’s entry within the GDS graph catalog.
- Returns:
- dict
A dict object describing the graph that was dropped as a result of calling this method. The dict follows the same format as one of the list elements returned by calling list_current_graphs().
- export_graph(graph_name, to='db')¶
Export a graph stored in the GDS graph catalog to a set of CSV files.
- Parameters:
- graph_namestr
A name of a graph, corresponding to the ‘graphName’ field in the graph’s entry within the GDS graph catalog.
- fetch(field, operator, value, what='both', register_graph=True, negate=False, query_type='cypher', **kwargs)¶
Create and execute a query to retrieve nodes, edges, or both.
- Parameters:
- fieldstr
A property label.
- what{‘both’, ‘nodes’, edges’}
The type of objects to fetch from the graph database. Note that this functions independently from any subgraph registered in Neo4j during query execution - if register_graph is True, an induced subgraph will be registered in the database, but the components returned by this method call may be only the nodes or edges contained in that subgraph.
- filterstr
‘Cypher-like’ filter statement, equivalent to a WHERE clause used in a Neo4j Cypher query (analogous to SQL WHERE clauses).
- query_type{‘cypher’, ‘native’}
Whether to create a graph using a Cypher projection or a native projection. The ‘standard’ approach is to use a Cypher projection, but native projections can be (a.) more highly performant and (b.) easier for creating very large subgraphs (e.g., all nodes of several or more types that exist in all of ComptoxAI). See “Notes”, below, for more information, as well as https://neo4j.com/docs/graph-data-science/current/management-ops/graph-catalog-ops/#catalog-graph-create.
Warning
This function is incomplete and should not be used until we can fix its behavior. Specifically, Neo4j’s GDS library does not support non-numeric node or edge properties in any of its graph catalog-related subroutines.
- fetch_chemical_list(list_name)¶
Fetch all chemicals that are members of a chemical list.
- Parameters:
- list_namestr
Name (or acronym) corresponding to a Chemical List in ComptoxAI’s graph database.
- Returns:
- list_datadict
Metadata corresponding to the matched list
- chemicalslist of dict
Chemical nodes that are members of the chemical list
- fetch_node_type(node_label)¶
Fetch an entire class of nodes from the Neo4j graph database.
- Parameters:
- node_labelstr
Node label corresponding to a class of entities in the database.
- Returns:
- generator of dict
Warning
Since many entities may be members of a single class, users are cautioned that this method may take a very long time to run and/or be very demanding on computing resources.
- fetch_nodes(node_type, property, values)¶
Fetch nodes by node property value.
Allows users to filter by a single node type (i.e., ontology class).
- Parameters:
- node_typestr
Node type on which to filter all results. Can speed up queries significantly.
- propertystr
Node property to match against.
- valuesstr or list
Value or list of values on which to match property.
- Returns:
- list of dict
Each element in the list corresponds to a single node. If no matches are found in the database, an empty list will be returned.
- fetch_relationships(relationship_type, from_label, to_label)¶
Fetch edges (relationships) from the Neo4j graph database.
- find_node(name=None, properties=None)¶
Find a single node either by name or by property filter(s).
- find_nodes(properties={}, node_types=[])¶
Find multiple nodes by node properties and/or labels.
- Parameters:
- propertiesdict
Dict of property values to match in the database query. Each key of properties should be a (case-sensitive) node property, and each value should be the value of that property (case- and type-sensitive).
- node_typeslist of str
Case sensitive list of strings representing node labels (node types) to include in the results. Two or more node types in a single query may significantly increase runtime. When multiple node labels are given, the results will be the union of all property queries when applied
- Returns:
- generator of dict
A generator containing dict representations of nodes matching the given query.
Notes
The value returned in the event of a successful query can be extremely large. To improve performance, the results are returned as a generator rather than a list.
- find_relationships()¶
Find relationships by subject/object nodes and/or relationship type.
- get_graph_statistics()¶
Fetch statistics for the connected graph database.
This method essentially calls APOC.meta.stats(); and formats the output.
- Returns:
- dict
Dict of statistics describing the graph database.
- Raises:
- RuntimeError
If not currently connected to a graph database or the APOC.meta procedures are not installed/available.
- get_metagraph()¶
Examine the graph and construct a metagraph, which describes all of the node types and relationship types in the overall graph database.
Notes
We currently don’t run this upon GraphDB instantiation, but it may be prudent to start doing that at some point in the future. It’s not an extremely quick operation, but it’s also not prohibitively slow.
- list_existing_graphs()¶
Fetch a list of projected subgraphs stored in the GDS graph catalog.
- Returns:
- list
A list of graphs in the GDS graph catalog. If no graphs exist, this will be the empty list
[].
- run_cypher(qry_str, verbose=True)¶
Execute a Cypher query on the Neo4j graph database.
The
- Parameters:
- qry_strstr
A string containing the Cypher query to run on the graph database server.
- Returns:
- list
The data returned in response to the Cypher query.
Examples
>>> from comptox_ai.db import GraphDB >>> g = GraphDB() >>> g.run_cypher("MATCH (c:Chemical) RETURN COUNT(c) AS num_chems;") [{'num_chems': 719599}]
- stream_named_graph(graph_name)¶
Stream a named GDS graph into Python for further processing.
- Parameters:
- graph_namestr
A name of a graph in the GDS catalog.
- stream_subgraph(node_types, relationship_types='all')¶
Extract a subgraph from the graph database and return it as a Python dictionary object.
- Parameters:
- node_typeslist of str
A list of node types to include in the subgraph. Node type names are case-sensitive.
- relationship_typeslist or str
A list of relationship labels (edge types) to include in the subgraph. Alternatively, the string ‘all’ may be used to denote that all relationships in the induced subgraph should be included. Relationship type names are case-sensitive.
- Returns:
- dict