During the holidays, those of us who are interested in data may have found ourselves spending an inordinate amount of time thinking about the staggering number of packages that get delivered all over the place every day, and just how they get moved around. It turns out that one of the things the shipping industry has done to make everything much more efficient is containerization: transporting goods in special containers of standardized dimensions called “intermodal containers,” or simply “shipping containers.” These containers can be loaded, stacked, transported over long distances using various transportation modes (ships, trains, semitrailer trucks, etc.), and unloaded efficiently, with a high degree of automation and without ever being opened.
The standard dimensions of shipping containers greatly decreased the cost and complexity of shipping goods, because shippers no longer had to concern themselves with the challenges of “break bulk cargo,” those goods that must be loaded and transported individually in special ways according to their types (bags, bales, barrels, boxes, etc.) and sizes. Before containerization, a ship bringing goods to a port for trucking to their destination would typically be offloaded onto the dock and then moved into a warehouse, after which the goods would be loaded from the warehouse onto trucks. With containerization, however, containers can be moved directly from the ship to a truck without ever being opened, allowing the ship to spend less time in port and reducing the movement of goods at the port. Shippers can stop worrying about the containers and focus on what’s in those containers.
I was at the Big DiP USA 2014 in September when my friend Tom Plasterer of AstraZeneca remarked to me, “I’m tired of hearing people talking about containers for data.” Tom was absolutely right: when it comes to data, we are still very much in the “pre-containerization” days. We haven’t standardized the way we represent data for transport, integration, and use on computers, so we’re forced to think and talk about our data containers every time we interact with our data. Every unique relational database schema is a container for data, one that is essentially incompatible with every other container for data, and so every relational schema requires special handling from start to finish. That makes the data in the databases defined by those schemas “break bulk” data, similar to “break bulk” cargo, with all of the inefficiencies that go along with that way of working with data. We need to containerize our data by setting and using standards for knowledge representation, so that we can stop thinking about data containers and focus on what those data are and how we can use them to accomplish our goals.
As in the early days of containerization in the shipping industry, today we use multiple knowledge representation standards. One of these is expressive, powerful and built for the World Wide Web — the Resource Description Framework (RDF). RDF provides a uniform way to represent knowledge so applications and their users can focus on the data, not the container for the data. RDF models data in the form of a directed, labeled graph.
Graphs are a powerful way to represent knowledge, but because they’re highly connected data structures and it’s hard to predict what part of a graph is going to be the target of any particular search step, they often cause performance problems in conventional compute architecture. Cray has built a supercomputer for graph analytics at scale to help users take advantage of the flexibility and efficiency of containerized data. The Urika-GD™ graph discovery appliance makes graph computation much faster by providing a very large, globally shared memory together with an array of massively multithreaded processors, allowing users to easily query graphs with tens of billions of edges.
Let’s stop spending so much energy on containers for data. Let’s choose the right knowledge representation so we can load, unload and transport our data without worrying about its containers, so that we can focus on the data itself.