We’ve written a few posts about the Cray Graph Engine (CGE), a robust, scalable graph database solution. CGE is a graph database that uses Resource Description Framework (RDF) triples to represent the data, SPARQL as the query language and extensions to call upon a set of “classical” graph algorithms.
There are two main advantages of CGE. One is that it scales a lot better than most other graph databases — because the other ones weren’t designed by supercomputer wizards. (Of course I would say that.) The other advantage is that not only does CGE scale well, it performs unusually well on complex queries on large, complex graphs.
Typical of a lot of complex graph queries:
Where are all the places where the red pattern matches a pattern in the large graph?
We can use existing benchmarks to demonstrate CGE’s performance, including LUBM (the Lehigh University Benchmark), SP2B (the SPARQL Performance Benchmark) and BSBM (the Berlin SPARQL Benchmark). But the problem is that none of them really shows the power of CGE.
We’ve been working on a new extension to LUBM for RDF triples, called the LUBM Extended Benchmark, or LEBM. That’s right: LUBM and LEBM.
LUBM, published by Lehigh University computer scientists in 2005, synthesizes data for any number of universities specified by the user. It is primarily aimed at measuring ontology/inference capability. Fourteen SPARQL queries are specified for the benchmark. In terms of complex graph queries, though, LUBM doesn’t have many. The two most complicated ones look for a triangular relationship – for example, “show all the students who take a course taught by their faculty advisor.” Some of our competitors get bogged down even on these triangular relationships, and only attempt them on relatively small datasets. We can do a lot better, but we need a benchmark that lets us demonstrate that.
Accordingly, we developed LEBM with complicated graphs and graph queries in mind. Here’s how it works:
- We start by synthesizing the LUBM dataset for some number of universities. LUBM with two universities (which we abbreviate “LUBM2”) has about 240,000 triples in it – which you can think of as a graph with 240,000 edges, the links between vertices. LUBM100 has about 14 million triples, LUBM1000 has about 140 million, and so on.
- Then we run the LEBM synthesizer, which picks out all the students in the LUBM dataset and creates a Facebook-like “friends” network between them.
- This “friends” network connects students on the same campus relatively densely, but also connects them more sparsely with students at other universities. Buddies from high school, maybe.
- The number of “friends” each student has fits an exponential distribution. Intuitively, this means that most students have a number of friends that is close to the average, but a few “popular” students have a much larger number of friends. This is a trait common to many real-life “social networks” – graphs of people and the relationships between them: I have about 50 Facebook friends; Beyoncé has 65 million.
What we end up with, superimposed on top of the LUBM graph database, is a large, irregular, lumpy, intra- and inter-university graph of student friendships that is challenging to search and match against. We can ask more complicated questions against this database. One of my favorites is one we call the “Blackmail Query”: Show me the students and the professor where one student is taking a course from the professor and the student has a friend at another university, at which that professor did his/her undergraduate studies. This ends up defining a hexagonal relationship in the SPARQL query.
We’re finishing rewriting the LEBM synthesizer in Java (our first version of LEBM was written in a non-portable Cray C++ dialect), and I hope to make it publicly available in the near future. Along with the LEBM synthesizer, we plan to publish some benchmark queries – Blackmail and others – and CGE times on them.