One trillion. That’s quite a large number. It’s 132 times the number of people on Earth; five times the number of tweets posted in a year on Twitter; the number of search engine queries Google serves in five months; and now equivalent to the number of triples we’ve loaded, inferenced and queried using our own Cray Graph Engine.
At last week’s Cray User Group (CUG) meeting, a paper submitted by a team of Cray engineers — Chris Rickett, Utz-Uwe Haus, Jim Maltby and Kristi Maschhoff — received the 2018 Best Paper Award for “Loading and Querying a Trillion RDF Triples with Cray Graph Engine on the Cray XC.”
I am a marketing guy, so I think of achievements in terms of “first, only or best.”
We know Cray isn’t your only option for graph database technology capable of loading a trillion-triple database. Take a look at the W3C wiki page on large triplestores, where Oracle and Franz Inc. have posted trillion-triple results; and Cambridge Semantics, where a similar benchmark was announced.
But I can say with pride that we are the best, and I’ll use the rest of this post to support that claim.
First, I’ll come out and admit that absent any widely recognized reporting standard — like those published by the Transaction Processing Performance Council — comparing benchmark results in the semantic graph field can be a bit of an “apples to oranges” comparison; there is no single agreed-upon measure of relative value. And the infrastructures themselves vary widely. Oracle’s results are measured using an integrated database appliance, Cambridge Semantics results are on a Google Cloud infrastructure, and our results are achieved using a combination of a Cray® XC™ series system and Lustre®-based ClusterStor™ storage.
Despite the differences in infrastructure, the results are impressive.
The results reported in the paper were made using the Lehigh University Benchmark (LUBM) for semantic web repositories. The LUBM benchmark has become a de facto standard for graph database benchmarks and describes the structure of a university with departments, courses and students. By using the 5500K scale, our team generated an artificial dataset representing 5,500,000 universities. I can’t help but note that the aforementioned results reported by Oracle and Cambridge Semantics used a smaller 4400K scale (meaning they approximated the data for 4,400,000 universities). Why the difference? Our inferencing engine is more efficient, so to get to the magic 1-trillion mark, we needed to start with a larger initial dataset.
So, let’s look at some of the good stuff.
• Data loading performance: 177 million quads loaded and indexed per second
• Inference performance: 501 million quads inferenced per second
• SPARQL query performance: 96.3 seconds to run all LUBM queries
Astute readers may notice the slower load time we report versus the load time reported by Cambridge Semantics. This is a perfect example of an apples-to-oranges comparison related to the benchmark system setup. In our case, the initial data generated by the lubm-uba data generator was written to a Lustre file system spread out over 96 files of 1.4 TB each. In the case of the Cambridge Semantics benchmark, the LUBM 4400K data was generated locally and stored on SSDs in each of the nodes, with loading time measured as the time to load from the local SSDs to memory simultaneously. We believe that a more realistic load time for Cambridge Semantics should have included the time required to copy the data to the SSDs in the first place.
But seriously … 96 seconds for all 14 LUBM queries is outstanding. This is nearly an order of magnitude faster than the Cambridge Semantics results!
And to put the trillion-triple dataset in perspective, the team also loaded a graph database that linked public datasets commonly used in life sciences and systems biology, where a typical workflow for researchers might be: perform searches in one of the databases, construct queries for another database, and iterate.
The union of these databases dwindles in comparison to a 1T triple dataset — 0.05T triples combined — but highlights the benefit of using large lab-local datasets. Fast CGE load times — highlighted in the paper — enable frequent updates of the working knowledge set many times a day. Furthermore, combining all those datasets (as named graphs) in one CGE instance makes complex cross-database queries feasible, reducing time to insight.
Behind the benchmark effort is a focused effort by the Cray R&D team to improve the overall performance of the Cray Graph Engine. Our next release — CGE 3.2UP01 — will include the enhancements that made these results possible. Learn more about the Cray Graph Engine on our website.
Again, congratulations to all involved for this achievement!