In my last post I outlined some of the reasons why the promise of graph analytics takes thought and planning to really capitalize on its potential. Now, I’d like to focus on getting started, appropriately, at the beginning with understanding the data itself.
So let’s discuss methods of gaining an initial understanding of our data such that we can then feed that newfound understanding back into our ingestion process. First, we need to be able to query the database for some basic information that will tell us not only what is there but how the relationships are expressed. We would also, ideally, find out if there is a hierarchy to the relationships we find. By hierarchy, I mean that graphs can use containers, if you will, to organize information about different types of entities (e.g., events, people, places or things), thereby creating many more useful entries than what the basic triple allows us to describe.
To make sense of a new and unfamiliar dataset, we can run a series of exploratory queries that will help us understand the data’s macro structure and help us dig deeper into the data for even deeper understanding. In short, instead of waving my hands in the air and muttering “presto” under my breath while skipping over all the steps it took to get to those promised miraculous insights, I’ll go through some of the techniques I commonly use to understand a new and unfamiliar dataset. Bear in mind that few, if any, of these techniques are proprietary. Many of them are taken word for word from existing research on graph analytic techniques; others are adapted from the combination of many different techniques.
So, where to start? First, it makes sense to see how big the graph is. This is an easy and common-sense technique, so I will not belabor the point, but simply put the query below:
How many triples are there?
SELECT (COUNT(*) AS ?count)
?s ?p ?o .
How many distinct predicates are there, and how many objects are associated with each?
SELECT ?p (COUNT(?o) AS ?count)
?s ?p ?o .
GROUP BY ?p
ORDER BY ?p
A count of all records is something of a given, but why count the predicates and how many values are associated with them? We can get a really quick idea of how complex our graph is as it relates to the types of relationships that exist.
Next, we want to explore this further. If an ontology exists, we should be able to see evidence of that in the predicates we just saw. In particular, we look for classes and subclasses:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?p ?o (COUNT(?o) AS ?count)
?s ?p ?o .
FILTER (?p IN(rdf:type,rdf:Class,rdfs:SubClassOf))
GROUP BY ?p ?o
ORDER BY ?count
Now we see a bit more structure in that we can now see what our ontology is. Now, it might be useful to see if our data has any implicit structure based on these findings:
SELECT ?type ?p (COUNT(?o) AS ?count)
?s a ?type .
?s ?p ?o .
GROUP BY ?type ?p
ORDER BY ?type
So, now we can see that there is indeed a more complex structure. Perhaps it might be useful to visualize this as a graph instead?
Now that we know our dataset a bit better, let’s summarize. We took a dataset we knew nothing about and discovered the following:
- The size of the graph
- Complexity (g., how many dimensions)
- The ontology
- There is a hierarchy beyond just the simple graph structure
- What properties are associated with each entity type
In our next installment, we can use the information we have gathered to dig deeper into the graph. Now that we know something about the types of relationships that exist, we can use that information to determine the overall structure of the graph from the standpoint of each type of entity.
Please feel free to comment or reach out to me personally if you have any questions about this work or would like to dig a bit deeper into the thought process. If you have come up with your own workflow for accomplishing similar goals, I would love to hear about it!