Don’t use a hammer to screw in a nail: Alternatives to REGEX in SPARQL

In the past I’ve talked about some tips for SPARQL implementation and performance, but one interesting type of query that I didn’t touch on comes up time and again as a SPARQL performance problem. In fact, more often than not, it is actually user naiveté that is the real cause.

So what does this horrible query look like? Strangely, it is quite innocuous, and I bet a good number of you have written exactly this at some point in the past:

SELECT *
WHERE
{
  ?s <http://some/predicate> ?o .
  FILTER (REGEX(?o, "search"))
}

While this looks like a perfectly sane query, the fact of the matter is that it is really anything but.  Any kind of FILTER in SPARQL involves iterating over all the possible solutions found at the point where the filter is applied and evaluating the expression.  The problem here is that REGEX is far and away one of the most expensive expressions you can ask your SPARQL engine to evaluate for several reasons:

  1. Your SPARQL engine has to look up the actual string value of the variable in question for every possible solution.
  2. It has to evaluate a regular expression over that string value.

But wait, lots of functions look up the actual value, so why is that such a big issue for REGEX?

Well, as any programmer can tell you, regular expressions are expensive to evaluate regardless of what language you are using them in. If you can avoid using a regular expression in favor of a simpler string computation, then you can likely get much better performance out of your SPARQL engine.

So let us start with my previous example, a simple case of finding values containing a simple search term.  With any SPARQL 1.1 implementation we can use the CONTAINS() function [1] instead like so:

SELECT *
WHERE
{
  ?s <http://some/predicate> ?o .
  FILTER (CONTAINS(?o, "search"))
}

This query will almost certainly run much faster, because it is doing a much simpler string search. Now you might rightly point out that one disadvantage of this approach is that REGEX can do a case insensitive match; however, we can simulate this by using the LCASE() or UCASE() functions [2,3] e.g.

SELECT *
WHERE
{
  ?s <http://some/predicate> ?o .
  FILTER (CONTAINS(LCASE(?o), "search"))
}

Another common style or regular expression query is the following:

SELECT *
WHERE
{
  ?s <http://some/predicate> ?o .
  FILTER (REGEX(?o, "^prefix"))
}

Here the user is looking for strings that start with some prefix; again, we can replace this with the much simpler STRSTARTS() function [4] like so:

SELECT *
WHERE
{
  ?s <http://some/predicate> ?o .
  FILTER (STRSTARTS(?o, "prefix"))
}

Conversely, if we care about strings that end with some prefix we can use the corresponding STRENDS() function [5].

Now sometimes we may have more complex string search requirements that the SPARQL String functions [6] really cannot cope with; however, you can still sometimes avoid regular expressions if your SPARQL engine supports full text extensions. Full text extensions are a whole family of non-standard and typically non-interoperable extensions that allow you to query textual values more directly in your SPARQL queries, often by using the advanced text search syntax of something like Lucene.

For example, the open source Jena Text [7] module from the Apache Jena project [8] allows you to write queries like the following:

PREFIX text: <http://jena.apache.org/text#>

SELECT *
{ ?s text:query (<http://some/predicate> 'search') ;
     <http://some/predicate> ?o .
}

This query achieves the same as our original example, but is likely orders of magnitude faster since it takes advantage of a specialized text index. Note that this inverts the query flow from our original example, rather than selecting all possibilities and then filtering over them. We instead select the relevant solutions directly and then select any further information we wanted.

Many other open source and commercial stores include some full text search extensions, so if you use these kinds of query commonly, then talk to your project/vendor about what they support in this area.

[1] http://www.w3.org/TR/sparql11-query/#func-contains
[2] http://www.w3.org/TR/sparql11-query/#func-lcase
[3] http://www.w3.org/TR/sparql11-query/#func-ucase
[4] http://www.w3.org/TR/sparql11-query/#func-strstarts
[5] http://www.w3.org/TR/sparql11-query/#func-strends
[6] http://www.w3.org/TR/sparql11-query/#func-strings
[7] http://jena.apache.org/documentation/query/text-query.html
[8] http://jena.apache.org

Rob Vesse

Comments

  1. 2

    Konrad says

    Thanks for the suggestions! Did you benchmark them to make sure that they really are faster than the alternative? I am especially interested in the performance of “CONTAINS(LCASE(…” vs REGEX with the “i” flag. Do some query engines optimize such filters internally?

  2. 3

    Dixit Singla says

    Thanks for the article.

    In my case I have used STRSTARTS still I the sparql query is taking a lot of time to run the query.

Speak Your Mind

Your email address will not be published. Required fields are marked *