Machine learning refers to a broad spectrum of analysis techniques that are used to gain value from large data sets. It is used for projections (for sports, politics and investment) as well as for making business decisions like search engine optimization.
Machine learning algorithms, and the processors that implement them, typically work on tabular data— finding correlations and patterns in the tables. But a lot of the data in the real world is more naturally represented as a graph, in which some entity is related to some other entities in various ways. For example, social network data records relationships between people. They can be related by being members of the same club, liking the same music, or having purchased the same item. Graph data, that represents relationships between things, is a good way to represent this.
The flexibility of a graph database means that we can relate entities to one another in a huge number of ways—even combinatorial ways. In the social network example, if a person belongs to five clubs, that is five different ways they can be related to other people. This richness of data is a great boon to SEO and data projection; we can take advantage of subtle nuances of relationships between entities.
But this power and flexibility comes at a price from a machine learning point of view. A simplified way to think about machine learning is that you are looking at a large number of example data points. There is some aspect of these things that you want to predict—in SEO examples, whether this person is interested in some product. In political forecasting, whether this candidate or issue will be popular with the electorate. In sports, will this individual or team perform well in some circumstance? We can view this simplified notion of data as a plot where the data points are displayed as plusses or minuses; the pluses are data points that had a positive outcome, the minuses are those that had a negative outcome.
The goal of a machine learning algorithm is to draw a line that does a good job of discriminating the pluses and minuses; the various algorithms differ in the sorts of lines they can draw and the measure of “goodness,” usually based on how many and how far away the points are from the discriminating line. More data points are good for this sort of machine learning; the denser the points, the more information the algorithm has to work from, to find a good fit for the discrimination. It is pretty easy to saturate a 2-dimensional diagram like figure 1 with enough points to drive a good machine learning algorithm.
Now imagine figure 1 if it were in three dimensions. The same number of points would be spread out in space. It would take a lot more points to determine a good boundary between the pluses and the minuses. Increasing dimensionality even a little increases the amount of data needed by a lot. In some sense, even very large data sets are too small to provide high density, when the dimensionality is high, i.e., when there are a lot of different attributes.
In graph data, each new attribute you measure about an entity increases the dimensionality of the learning space. Even if we can’t imagine what it looks like, machine learning algorithms are capable of finding discriminants in spaces with many more than three dimensions—dozens or hundreds of dimensions are commonplace. If you consider a rich graph data set, you could be dealing with hundreds or thousands of dimensions.
How can we deal with this situation? One solution would be to ignore the richness of the data, and do our learning on a set with reduced dimensionality. But this would leave behind a lot of insights that are available in the data. A better solution is to realize that the machine learning algorithms are only part of the story; that finding a low-dimensionality, high-density subset of a rich dataset is a precursor to running a machine learning algorithm successfully.
This is where a graph database comes into play. A graph database has no trouble representing all the subtlety of how entities can relate to one another. Hidden in that graph—that would have thousands of dimensions, if viewed as a simple table—is hidden some smaller graph, a projection onto a manageable number of dimensions, that can provide the predictive power that machine learning algorithms can exploit.
There isn’t a magic bullet here; there’s still a lot of analysis and luck involved in finding the right projection, but a graph data appliance provides the tool that is needed at this stage. You can see an example of this principle in work in a recent application of the Urika appliance, in which highly multi-dimensional baseball data was distilled down into a manageable size to provide advice about recruitment and line-up.
These observations suggest a broader view of machine learning, in which data is collected from the real world in a natural, graph-based way. This data is managed in a graph data store, from which low-dimensional, tabular data set can be extracted. Conventional machine learning algorithms are applied to these data sets, providing projections and classifications that are of interest to the business. The key to this cycle is iteration; you won’t know the real value of a low-dimensional projection until you actually run the machine learning and test the outcome. This means that you need to be able to generate a project quickly, test it, and try again. This is exactly what a graph data appliance provides: A way to manage complex queries against very large data sets. This does not replace conventional machine learning; it enhances it, giving us an opportunity to exploit all the nuances of the complex data in the real world.
Dean Allemang, co-author of the bestselling book, Semantic Web for the Working Ontologist, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world’s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology. Dean’s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.