Practical SPARQL Benchmarking

There is a certain amount of misguided belief in the market that Semantic Web technologies simply aren’t performant enough for the needs of a business and I often hear this presented as a reason for not choosing these technologies over a traditional RDBMS or other technology such as a NoSQL solution.

While there is some historical truth to these claims since these technologies are still relatively new there are now a slew of scalable and performant production ready systems arriving on the market from both commercial vendors and open source projects targeted at a variety of levels of scalability.  We ourselves at Cray are building the uRiKA graph appliance which seriously pushes the boundaries of performance and scalability.

With an increasing number of products to choose from how does a business decide on the appropriate product for their business problem?

Typically people evaluate their options based on the vendor published benchmarks, but as I highlighted in my recent SemTechBiz [1] talk [2] there are some issues with this approach.  Firstly the standard benchmarks are all designed to benchmark stores in very different ways which may bear little or no resemblance to how you will actually use the product to solve a business problem.  Secondly vendors can be somewhat less than transparent about their methodologies and test environments.  Thirdly most benchmarks focus purely on speed and throughput.

Often the user is not interested in how fast a system answers their query but rather in whether it can answer their query at all.  A slower system that can answer a query is likely preferable to a faster system that fails on a query from a users perspective.  Ultimately a user must judge a system on whether it solves their business problem not on some benchmark that bears no resemblance to their problem.

To try and address these problems I presented a tool at SemTechBiz called SPARQL Query Benchmarker [3] that was developed internally here at Cray for the purposes or running standardized repeatable benchmarks for performance and regression testing.  We found the tool so useful that we’ve made it open source and available to the community as we’re hoping to promote more transparency and repeatability in benchmarking.

The key features of this tool are a command line interface and API that allows you to run any set of queries against any SPARQL endpoint.  This empowers users with the means to gauge whether a system performs on their data, with their queries on their hardware and allows them to make an informed decision about which system is performant enough to solve their problem.

References

[1] http://semtechbizsf2012.semanticweb.com/
[2] Presentation slides: Practical SPARQL Benchmarking
[3] https://sourceforge.net/projects/sparql-query-bm/

Rob Vesse

Speak Your Mind

Your email address will not be published. Required fields are marked *