Making Sense of 50 Billion Triples: No Free Lunch

A lot of grandiose claims have been made promising that graph databases would allow easy ingest of all manner of disparate data and make sense of it ­– and uncover hidden relationships and meaning. This is, in fact, possible — but there are a few considerations that you need to account for to make your database useful to an analyst charged with making sense of the information. There simply is no free lunch; where time and effort are saved in one place, they must be expended (at least partially) elsewhere.

Let’s take a look at the fundamental difference between graph databases and relational databases from which these claims stem: Rather than store data in rows and columns, graph databases store data in a simpler format that describes a series of simple relationships. Two entities (i.e., nodes or vertices) are connected by a directed edge to form a triple: subject, predicate and object. Since the only requirement for inserting data into a graph database is that the entry contain a subject, predicate and object, then it stands to reason that ingesting disparate data sources is a much simpler affair than if one had to design a schema and potentially multiple tables to store that data.

Herein lies the rub: Though it is true that literally any data can be simply tossed into a graph database without any care, in the famous words of Mister T, “I pity the fool!” who has to actually sit down and make sense of the data that results from such a careless ingestion method. Further, I would hope that at the very least, the poor soul responsible for analyzing the resulting hodgepodge would be provided with a lifetime supply of aspirin.

To standardize data prior to ingest, one needs to develop a mechanism that maps raw data to a particular ontology or taxonomy. To accomplish this, one can implement a rule-based system or even use machine learning to get the job done. The machine learning approach is in use to some degree for DBpedia, which employs a complex data-processing workflow to map Wikipedia data to the DBpedia ontologies. This approach is certainly feasible, but takes us right back to where we started as far as trying to benefit from easily ingesting data with minimal preprocessing.

Given the goal of living up to the claim that one can simply toss data in, the more contextually appropriate method here is to ingest the data as is for the most part and then sort it out once it has been ingested. To do this with a very small graph, as seen in many examples, is trivial — but to do it at scale, not so much. What about when we are looking at dozens or even hundreds of data sources? If we have enough knowledge and understanding of our source data, this would be feasible, but real datasets usually start off as relative mysteries to us.  It is here that a graph database can truly shine.

That will be the topic of my next two posts; I’ll provide a few tactics to get you going on any size dataset and then dive into deeper into the graph. As a solution architect for Cray, I’m biased toward our customers’ uses of graph analytics, so I encourage you to question and test my assertions and then come to your own conclusions.



  1. 1

    DESi Benz says

    The author is making an assumption that subjective triple creation for mere ingestion is an accepted practice. It is prudent to have the triples resultant of an ontology or, at minimum, resultant of a domain-relationship model. In short of this, what one has is a graph of data values – in other words, this is analogous to reverse engineering the “types” of raw materials used in an Asian curry without a handbook of spices as a reference.
    An ontology or a domain model to large degree takes out the inherent subjectivity of the data curator. For an unknown type of dataset, triple generation is rather intra- record based; and inter- record based relationships are of course unknown, which hopefully can be inferred from the contemplated graphing exercise of classification, clustering, associations, etc. Isn’t the output of such a relationship model, which can be further distilled to a domain model, which in turn can be aided to ontology construction? So, there is no getting away from making unknown-to-known for any consumption.

Speak Your Mind

Your email address will not be published. Required fields are marked *