Skip to main content

Open-Source Triplestore Battle

· 7 min read

Blazegraph vs. Fuseki

There are many graph databases out there that support Resource Description Framework (RDF): Virtuoso, GraphDB, Stardog, AnzoGraph, and RDFox, to name just a few popular ones. But if the requirements for your triplestore include open source, as it does for our CFI-funded LINCS project, then Blazegraph and Apache’s Jena Fuseki are two of your most mature options...

This article compares Blazegraph and Jena Fuseki, two contenders for the LINCS graph database. Thanks to Angus Addlesee for writing an article that compared Blazegraph with commercial triplestores and inspired the testing methodology for this post.

Blazegraph​

Blazegraph, previously known as Bigdata, is a great triplestore that scales to billions of triples with thousands of proven use cases. In fact, it was so good that AWS bought the Blazegraph trademark almost five years ago and hired some of its staff, including the CEO. Unfortunately, that meant that most of Blazegraph’s development experience was used to create a competing product: Amazon Neptune. Although the official releases of Blazegraph have slowed down, it still supports SPARQL 1.1 and is by no means outdated.

Fuseki​

Apache’s Fuseki, along with the entire Jena project and all its plugins, is still actively developed as of October 2020. It supports the SPARQL 1.1 update and gets new features and enhancements with each new release, which takes place every quarter or so. We know that Fuseki can scale loading the entire Wikidata dump. But what is query performance like and can it be compared to Blazegraph? Let’s find out!

The Setup​

Trying to have a fair competition in a matchup like this is very difficult. Different products almost always have different strengths and selective benchmarking can easily skew results. Getting one-sided results was not the intention here, but I did choose a small set of tests, as an exhaustive test suite would require a book and not an article. My testing involved loading a Olympic sports dataset with ~1.8m triples and then executing some timed SPARQL queries using the built-in web interface of both triplestores.

The Blazegraph instance is based on a September 2016 build from the 2.2.0 branch as per the Dockerfile. This image has full-text search enabled as well as a geo index.

I used this docker file from the LINCS project to create a Fuseki instance based on the latest v3.16 release. It is a basic TDB2 configuration with a full-text index for all rdfs:label properties.

The tests were executed on an 8-series Core-i5 with SSDs and plenty of RAM. Neither triplestore was “warmed up” and queries were executed in the same order and the same number of times in an effort to keep the playing field as level as possible.

The Tests​

The SPARQL queries used these prefixes:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

Queries were executed twice and both results were recorded.

Loading Data​

It is pretty important to most projects to know how long it will take to load data into the triplestore. Since our dataset is relatively small (< 2 million triples), I was able to use the web interface of both triplestores to load the TTL file without any issues.

    Fuseki: 57s, 30.3s
Blazegraph: 57s, 21.5s

The second run was an update with the same TTL file used in the first run. No actual changes were made to the graph.

Counting Triples​

The simplest of queries to just see how many triples are in the dataset.

    SELECT (COUNT(*) AS ?triples)
WHERE {
?s ?p ?o .
}

Fuseki: 0.8s, 0.5s
Blazegraph: 0.02s, 0.01s

It looks like Blazegraph did some pre-aggregation here while loading the data.

Regex Filter​

    SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .
?games dbp:location ?city ;
walls:season ?season .
?city rdfs:label ?cityName .
?season rdfs:label ?seasonName .
?athlete rdfs:label ?name .
Filter (REGEX(lcase(?name),"louis.*"))
}

Fuseki: 7.7s, 5.0s
Blazegraph: 7.0s, 4.2s

Blazegraph was consistently faster and dips ahead further with this typical query.

Full-Text Searching​

Using the full-text index efficiently required slightly different queries because Fuseki performed very slowly unless the full-text search was the first filter.

    PREFIX text: <http://jena.apache.org/text#>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
?athlete text:query ('louis*') ;
rdfs:label ?name .
?instance walls:games ?games .
?games dbp:location ?city ;
walls:season ?season .
?city rdfs:label ?cityName .
?season rdfs:label ?seasonName .
}

PREFIX bds: <http://www.bigdata.com/rdf/search#>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .
?games dbp:location ?city ;
walls:season ?season .
?city rdfs:label ?cityName .
?season rdfs:label ?seasonName .
?athlete rdfs:label ?name .
?name bds:search "'louis*'" .
}
    Fuseki: 0.2s, 0.1s
Blazegraph: 0.3s, 0.1s

Moving the full-text filter to the top for Blazegraph too made it perform faster than Fuseki—0.08s for the first run and 0.04s for the second run.

Complex Join​


PREFIX noc: <http://wallscope.co.uk/resource/olympics/NOC/>
SELECT ?genderName (COUNT(?athlete) AS ?count)
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .
?games dbp:location ?city .
?athlete foaf:gender ?gender .
?gender rdfs:label ?genderName .
{
SELECT DISTINCT ?city
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .
?athlete dbo:team ?team .
noc:SCG dbo:ground ?team .
?games dbp:location ?city .
}
}
}
GROUP BY ?genderName

    Fuseki: DNF
Blazegraph: 7.0s, 6.0s

Fuseki did not manage to finish this query before the configured timeout of 10 minutes.

Federated Query​

This query joins a graph over the internet from dbpedia.org.


SELECT ?sport ?sportName ?teamSize
WHERE {
{
SELECT DISTINCT ?sportName
WHERE {
?sport rdf:type dbo:Sport ;
rdfs:label ?sportName .
}
}
SERVICE <http://dbpedia.org/sparql>
{
?sport rdfs:label ?sportName ;
dbo:teamSize ?teamSize .
}
}
ORDER BY DESC (?teamSize)
    Fuseki: 7.9s, 7.5s
Blazegraph: 0.5s, 0.4s

Summary​

TestFusekiBlazegraph
Data Load57s57s
Triple Count0.8s0.02s
Regex7.7s7.0s
Full-Text0.2s0.1s
ComplexDNF7.0s
Federated7.9s0.5s

I must admit that I was somewhat surprised by the results. Blazegraph performed consistently better than Fuseki in this scenario. The complex query that Fuseki just couldn’t finish could possibly be an indexing problem. Be that as it may, Blazegraph ran that same query just fine straight out of the box. Blazegraph also beat Fuseki by more than an order of magnitude with the federated query.

One possible explanation for these one-sided results is that Blazegraph’s indexes are better configured for this dataset and I need to apply more effort to get Fuseki’s indexes optimized. Please feel free to look at the config.ttl I used to configure Fuseki and let me know in the comments if I missed an obvious optimization or if I misconfigured something.

Fuseki Configuration Followup​

Although I tried to configure Fuseki with the simplest full-text index possible, I feared that a misconfiguration might have been the cause for the comparative disappointing performance. To rule out that possibility, and for the sake of completeness, I ran the benchmarks against default Fuseki databases created from the admin portal without any customization.

In-Memory StoreTDB StoreTDB2 Store
Data Load: 25s, 27sData Load: 43s, 30sData Load: 54s, 28s
Counting: 2.8s, 1.6sCounting: 0.9s, 0.7sCounting: 0.7s, 0.6s
Regex: 2.8s, 2.2sRegex: 4.5s, 3.3sRegex: 7.8s, 5.7s
Full-Text: Not EnabledFull-Text: Not EnabledFull-Text: Not Enabled
Complex: DNFComplex: DNFComplex: DNF
Federated: 8.7s, 7.4sFederated: 8.5s, 8.3sFederated: 7.7s, 7.7s

The results were more or less aligned with the original performance figures. So, it seems that vanilla Fuseki is just considerably slower than Blazegraph for this dataset and queries.