Let’s talk about graph databases. Some industry watchers claim that they are the fastest-growing type of database. If so, maybe it’s useful to know more about them.
Starting with the basics: What is a graph database, and what is it useful for?
Here’s the short answer. Graph databases store data in vertices and edges versus tables, as found in relational databases. They are the most efficient way of looking for relationships between data items, patterns of relationships or interactions between multiple data items, while traditional relational database shine at queries looking for information about some item, or sums or averages of many items of the same type of information.
Now let’s review what a graph database isn’t. The standard type of database is the relational database – the kind built with database management systems sold by Oracle, IBM and Microsoft and others. You can think of a relational database as made up of several tables, rectangular grids of information, each one looking much like a spreadsheet. Each table can have a different number of rows and columns, and hold a different set of types of information. For example, a snippet of a company’s employee database might hold data like this:
Another table in the same database might hold information about managers:
A graph database, at least conceptually, stores its data in a different structure, a directed graph. Conceptually, directed graphs are made up of bubbles and arrows, as in this diagram:
The bubbles are called “vertices” and the arrows are called “edges.”
Data items stored in one of the fields of a relational table are, in a graph database, stored in a vertex of the graph. Data descriptors, for example “Department managed” or “Reports to” in the table just above, are stored with edges in the graph. For example, if we took our management table above and represented it in a graph database, it might look like this:
Each data item occurs only once in the graph. There is a unique “Brenda Roberts” vertex, for example. In the type of graph database Cray uses, also called a “semantic” database, each field of the relational database corresponds to a simple, subject-verb-object triple in the graph: “Jack Jones” “Reports to” “Brenda Roberts.” I threw in a little additional information that’s not in the relational table— that Brenda Roberts manages the Accounting department — just to show that each vertex may be the “subject” of some triples (“Brenda Roberts” “Department managed” “Accounting”) and the “object” of others (“Jack Jones” “Reports to” “Brenda Roberts”).
So the big difference between relational databases and graph databases is how they represent data. Interestingly, the query languages used for each aren’t all that different. SPARQL, a standard query language for graph databases, looks a lot like SQL, the established standard for relational databases. Now, on to the important question: What is each of them good for?
Relational databases are great. After 40-some years of development and refinement, they are reliable, powerful and capable. They can hold huge amounts of data. Some of them can be updated thousands of times per second. If you want to query a relational database about sales, sales per product or sales per product per region, you’re in good shape. Any time you’re looking for information about some item, or sums or averages of many items of the same type of information, you’ll get the answer back quickly.
What are relational databases not good at? They fall down when you’re looking for relationships between data items, patterns of relationships or interactions between multiple data items. Let’s contrast two queries: first, suppose you wrote a query that amounted to
show all the employees who work in our Houston store
A relational database would scan through the employees table, looking for matches to “Houston” in the “location” field in the table. You’d get the answer back in milliseconds. Now, instead, suppose you asked
show all of the employees whose management chain includes the person who taught their new-hire-orientation class
The employee’s date-of-hire might be a field in the same employees table, but the new-hire-orientation class rosters are probably in another table. So the query has to take every employee and search against the class rosters to find which new-hire-orientation class they attended. Let’s assume that this points us to the instructor of that class. Now we have to do a really elaborate search. Is that orientation instructor the employee’s supervisor? No? Well, is the orientation instructor that person’s supervisor? And on and on we crawl up the employee’s management chain. Either of two things will happen: The relational database would take minutes, even hours, to get all the answers – or the database system will run out of resources and fail to return any answers. Graph databases, on the other hand, can return answers to this second query in milliseconds. Because they are built to search through graphs, they can traverse up through a management chain many times faster than a relational database could – if it succeeded at all.
You might infer that, since SPARQL looks a lot like SQL, it can probably handle a lot of the same types of queries that relational databases are good at. You’d be right. The graph database might not be as fast, but it’ll do OK. But it really shines on those complex queries that the relational database can’t handle well, if at all.
What kinds of applications can make good use of a graph database? Applications where it’s useful to find patterns of relationships between data items. This is not every database application – it’s no accident that relational databases are so popular – but there are many significant graph applications which cannot be analyzed accurately and efficiently via relational databases:
- In many intelligence and law enforcement applications, it’s important to look for a pattern of events. Any one of the events may look innocuous, but the view of all of them together, and how they are directly or indirectly related to each other, is ominous. Chief Bad Guy sends an email to Bad Guy A, makes a phone call to Bad Guy B and sends a courier message to Bad Financier. Bad Guy A takes a train to Berlin. Bad Guy B takes a flight to Berlin. Bad Financier wires money to Bad Guy C. Bad Guy C lives in Berlin. Bad Guys A and B take a flight from Berlin to Atlanta.
- Similarly, investment banks guarding against insider trading have to look for a suspicious pattern of actions, not necessarily any single action. Investment Banker A gets insider information about Stock S. Banker A sends email to Joe IT, an information technologist at the same bank. Joe IT phones Banker A. At close of business, Banker A and Joe IT badge out within seconds of each other. That night, Joe IT makes a purchase of Stock S.
- The bioinformatics research community has largely adopted graph databases and the SPARQL query language. They are a natural fit to the huge network of relationships between all the chemicals present in the human body. One of our bioinformatics customers traced a chain of “this interacts with this” relationships from a drug designed to combat AIDS, via various proteins, human cells and other molecules, out to a discovery that the same drug might be effective against breast cancer.
- Many vendors of consumer products have become interested in social network analysis (SNA). This has to do with constructing a graph of relationships between people. Facebook is an example of a social network, with its links between people and their friends. Graph searches might reveal which person is probably influential over his/her friends, which groups of friends share a common interest, and so on – which could lead to some very sophisticated, highly targeted marketing strategies.
Okay, that’s a quick look at why graph databases are worth considering. Basically it sums up to the fact that graph databases can answer complex questions that relational databases can’t. A truly sophisticated, effective analytics environment will include both relational and graph databases.