Tuning SPARQL Queries for Performance

There is often a perception that RDF/SPARQL based systems are not performant compared to traditional RDBMS based systems. Personally, I think this is a spurious and inaccurate portrayal of the technology for a couple of reasons:

1. Often people try to compare performance of the two styles of system for the same task; this is essentially making an apples-to-oranges comparison. The two systems are suited to answering very different styles of questions. Do you compare your document database or key-value store directly to a RDBMS? So why do people try and do the same for RDF/SPARQL?

2. RDF/SPARQL systems are by definition less mature; RDBMS have had 40 years of R&D to get to where they are today while the oldest RDF/SPARQL systems have around 10 years of R&D.

With regards to #2, I should make it clear that less mature doesn’t mean less performant. There are RDF/SPARQL systems out there today (both free and commercial) which deliver excellent performance on SPARQL.

However, one SPARQL performance problem I frequently see is that because the technology is relatively new, users don’t necessarily understand how to write good SPARQL queries in the same way that an experienced SQL developer does. In lots of cases this doesn’t matter because SPARQL optimisers and engines tend to be far better at optimizing whatever the user throws at them relatively well, but there are a few cases where users could potentially make their queries run much faster, and I’d like to share some of these.

Value Equality vs. Term Equality

Now a lot of users of SPARQL probably don’t even know what I mean by the above, but I would give you good odds that most people are using the former without realizing it or knowing that the latter could actually be more performant in many scenarios.

So to illustrate this, let’s show you the difference in syntax. First, let’s see value equality:

FILTER(?x = 1)

And now let’s look at the corresponding term equality form:

FILTER(SAMETERM(?x, 1))

The syntactic difference is obvious –value equality uses the = operator while term equality uses the SAMETERM function.  The latter is likely more performant but it does have a drawback in that it is only term equality. Term equality only returns true if the RDF terms are identical. So if the RDF term in the database was encoded as “001”^^xsd:integer term equality would give false whereas value equality would return true because the value of the terms is equivalent.

As a general rule, use term equality wherever the value of the term is not an issue, such as when you are matching URIs or non-value typed literals (e.g. plain literals, xsd:string, literals with language tags). Also if you know the RDF terms in your database are consistently encoded (or your database does this for you), then always use term equality unless value equality is more appropriate for your SPARQL query.

Note that many SPARQL engines will be clever enough to turn value equality into term equality when possible, but not all will, so sometimes if you have a slow running query when using value equality, try switching to term equality and see if that improves things.

Using overly broad FILTERs or FILTERs which can be changed into triple patterns

Sometimes I see and cringe at queries like the following:

SELECT *
WHERE
{
?s ?p ?o .
FILTER(?o = <http://example.org/Class>)
}

This filter is both overly broad in that it requires the SPARQL engine to apply it over a large swathe of data which is in of itself a bad idea, but it can also be trivially rewritten as a simple triple pattern like so:

SELECT *
WHERE
{
?s ?p <http://example.org/Class>
}

Note that many SPARQL engines actually do a rewrite of the query similar to this when they see such queries, but this rewrite is not always possible depending on the RDF term and the complexity of the portion of the query that the FILTER applies over. If you find yourself writing FILTERs like this, consider using the latter form. This can also apply even when you have a more complex filter like so:

SELECT *
WHERE
{
?s ?p ?o
FILTER (?o = <http://example.org/Class> || ?o = <http://example.org/OtherClass>)
}

This SPARQL query could be better rewritten as a UNION like so:

SELECT *
WHERE
{
{ ?s ?p <http://example.org/Class> } UNION { ?s ?p <http://example.org/OtherClass> }
}

Again, this will likely be much more performant because it creates a lot less work for the query engine than the FILTER form of the query.

Avoid SELECT *

A lot of users blindly use SELECT * for all their queries which can hurt query performance for a couple of reasons (even I tend to do this particularly when writing examples like earlier in this post). Firstly, this means that the database has to transfer more data back to you, so you get your results slower. More importantly, if you only select the variables you are actually interested in, the optimizer and engine may be able to evaluate your query faster because it doesn’t need to keep around all the variables in each solution for the whole duration of query evaluation.

Avoid DISTINCT

DISTINCT is often a very costly operation for a system and depending on implementation may require the full materialization of all results before an engine can start returning results. Unless you really need it, try to avoid DISTINCT; if you aren’t sure whether your query needs DISTINCT, you may want to try using REDUCED instead.

REDUCED allows the engine to choose whether to remove duplicate results. If you still get duplicates when using REDUCED (which may vary by implementation), then you can apply DISTINCT instead to force duplicates to be removed at some cost to performance.

Use LIMIT

This one is somewhat obvious: only ask for as many results as you actually need by using LIMIT.  Many systems calculate results in a streaming fashion anyway, so if you ask for fewer results, then they have less work to do and can complete your query faster.

Rob Vesse

Comments

  1. 1

    Nitish says

    I had doubt if the sparql query is evaluated last to first or first to last.
    eg: Select ?employee Company{?employee foaf:knows :Tom. ?employee :belongsTo ?employeeCompany}

    here first ?employee :belongsTo ?employeeCompany tripplet is evaluated and then ?employee foaf:knows :Tom. or vice versa .

Leave a Reply to Nitish Cancel reply

Your email address will not be published. Required fields are marked *