Facing the Challenges of Big Data

Big Data is all the rage, especially since it received a new shot in the arm by a recent Gartner report.  Everyone is interested in it, and Gartner [1] has thrown its hat into the ring for a definition of it.  But even so, many people use these words in different ways. What is Big Data, and why all the interest in this topic now?

For many people, Big Data is synonymous with search optimization (for many people, all of data science is synonymous with search optimization).  For commercial websites, visitor behavior equals revenue.  Understanding their behavior is essential for growing a web-based business and staying competitive.  So we gather data about our visitors, their habits, and try to glean some understanding of what does and doesn’t work to turn visits into value.

But for others, Big Data comes from outside a single web site.  Social networking sites provide access to huge amounts of data about user behavior.  If we could access and utilize this data, we could understand users who haven’t even visited our web sites.

You might not know it, if you just read articles about SEO, but there actually is a world outside of web commerce, and it is a big world.  Actually, several big worlds: there is a world of Finance, of Science, of Manufacturing, and these worlds are also data intensive.  Detailed financial behavior has a huge impact on financial indicators, and hence provides a huge opportunity for data exploitation [2].

What do all these things have in common?  All of them are spaces that are dominated by Big Data.

While there are some specific technologies that are often associated with Big Data, Big Data itself isn’t a technology; it is a description of a set of challenges in today’s world that are faced in these and many other domains.  What are the recurring aspects of a Big Data challenge?

1) Large amounts of data.  All of the examples I mentioned above—web traffic, social networks, financial transactions, and scientific measurements—are producing data at a faster rate than has ever been imagined before.  This has resulted in an apparent explosion of available data.  This driving force has given Big Data its name.

2) Complexity of data, and of the questions we need to ask.  User behavior, market analysis, scientific conclusions: to gain insight into any of these things requires creativity and a deep understanding of the complex interplay of many factors.  Simple data analysis was sufficient in the olden days.  Now we need complex answers to complex questions to understand the world and be competitive.  In many cases, we need those answers fast – sometimes lightning fast. [2]

3) Heterogeneity of data.  In the case of gathering data from a single web site, the data might come from a single source.  But in most of the cases I mentioned before, data will come from many sources.  Social networks track behavior across multiple sites and applications.  Financial data includes market intelligence, as well as massive transaction data.  Contributions are being made to the world’s collection of scientific data from all over the world.

These challenges have caused us to move our thinking about how to manage data from the tried-and-true methods that have held sway in enterprises for the past 30 years.  During that time, data management was done largely within a single organization (mostly homogeneous) and the data structure was largely understood (going all the way back to the “Master Data Record” that many businesses had for a long time).

Ironically, Challenge #1 (large amounts of data) is the one from which the Big Data movement gets its name, but is also the only one that has been addressed systematically by data management technology for decades.  The scale of relational database systems has improved steadily over the years; the ability of these systems to handle complex and heterogeneous data has not been as much a focus of development.

Most new technologies that have been developed to address the Big Data challenges have begun from a vantage point of Challenge #1. From there, they have taken different approaches to addressing challenges #2 and #3.

The World Wide Web Consortium has come at this problem from another angle. Starting naturally enough (for the W3C) with distribution of data as their point of focus, they developed RDF, a framework for managing data resources distributed over the web, and SPARQL, a powerful query language for RDF.

RDF achieves its data distribution goals by representing data as a graph; SPARQL provides a powerful way to manage the complexity that results from combining distributed data sets. Many critics have doubted whether such an approach, driven primarily as it is by data distribution and complexity concerns, can be further developed to address the scale issues of Challenge #1.

Just because an RDF database focuses on the complexity and diversity issues of the Big Data challenge, doesn’t mean it can’t deal with large data sets as well.  Many RDF databases and SPARQL engines today are able to scale to very large sizes.

Cray’s Urika™ technology is a good example. Urika™ directly addresses all three challenges of Big Data. As a RDF database, it excels in diversity and distribution of data.  Its highly parallelized architecture lets it excel at complex queries, achieving fast response times even for complex queries and for large data sets.

With the advent of high-performance RDF databases, these W3C technologies have become a key player in Big Data technology.


[1] http://www.gartner.com/newsroom/id/2359715
[2] http://www.thedailybeast.com/newsweek/2013/01/04/eunuchs-of-the-universe-tom-wolfe-on-wall-street-today.html

Dean Allemang, co-author of the bestselling book, Semantic Web for the Working Ontologist, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world’s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology. Dean’s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.

Dean Allemang

Speak Your Mind

Your email address will not be published. Required fields are marked *