The Enduring Myth of the SPARQL Endpoint

It surprises me that the Semantic Technology industry still talks with great frequency about the ‘SPARQL Endpoint’ (it’s come up a few times already at SemTech 2013). At best, a SPARQL Endpoint is useful as an ephemeral, unstable method to share your data. At worst, it is wasting the time and energy of providers and consumers of SPARQL endpoints due to the incompatible outcomes of scale and availability.

But before I explain this position, let me outline my understanding of what a SPARQL Endpoint is:

A technical definition:

  • An HTTP URL which accepts a SPARQL query and returns the results
  • Can return a variety of serialisations: Turtle, RDF XML etc

The intention of SPARQL endpoints

  • Give other people and organisations access to your data in a very flexible way
  • Eventually realise the potential of federated SPARQL whereby several SPARQL Endpoints are combined to allow complex queries to be run across a number of datasets
  • They are open for use by a large and varied audience

But what can SPARQL endpoints be used for? They are brilliant for hackdays, prototypes, experiments, toy projects etc. But I don’t think anything ‘real’ could ever be built using one.

There seems to be a cultural acceptance that SPARQL endpoints can be intermittently available, subject to rudimentary DOS attacks and have extremely long response times. This is no foundation for mass adoption of linked data technologies, and it certainly cannot form the fabric of web-based data infrastructure.

I want linked data to gain mainstream popularity. It is a great language for expressing meaningful data and fostering collaboration with data. But to succeed, people need to be able to confidently consume linked data to build apps and services reliably. To build a business on linked data means you need a source of regularly updated and highly available data. This takes investment, by the provider of the data, in highly available, secure and scalable APIs. This is already happening of course, but the SPARQL Endpoint endures.

How do SPARQL endpoints perform?

I thought I’d put my criticisms of SPARQL endpoints to the test, so I tried a few, and here’s what happened…

Note: the queries I have tried are intended to represent an intentional or accidental, rudimentary DOS attack. This is the kind of attack that a robust, open endpoint should be able to protect itself against.

Firstly, only 52% of known SPARQL endpoints were available on http://labs.mondeca.com/sparqlEndpointsStatus/index.html I don’t know how representative that is, but it’s not a good start.

Next, I tried some of the available ones, you’ll have to trust me that I picked these four at random…

(Apologies to the providers of these endpoints, I am not singling you out, I am making a general point).

http://pubmed.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. I ran the query the suggested query, and it hung for around 2 minutes, and then I got

Virtuoso S1T00 Error SR171: Transaction timed out

http://sgd.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. The suggested query ran quickly. I changed it to:

SELECT * WHERE { ?s ?p ?o . }

The results came back quickly, but then the data stopped being streamed back and hung for over 5 minutes, before I stopped waiting.

I then tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

And got:

Virtuoso 42000 Error The estimated execution time -1308622848 (sec) exceeds the limit of 1000 (sec).

That’s an ugly error message, but at least there is a protection mechanism in-place.

http://sparql.data.southampton.ac.uk

It worked fine for some friendly queries, but then I tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

and got:

Error: Connection timed out after 30 seconds in ARC2_Reader missing stream in "getFormat" via ARC2_Reader missing stream in "readStream"

http://bnb.data.bl.uk/sparql

I ran this basic query when I started writing the blogpost:

SELECT * WHERE { ?s ?p ?o . }

It is still failing to load around 10 minutes later.

Update: it was pointed out that the above are all research projects, so I tried data.nature.com/query and http://metis.bbyopen.com/sparql?query= too, and got similar results – connection reset and 60 second+ response times.

The incompatible aims of scale and availability

Whilst “premature optimisation is the root of all evil”, it would be reckless to build a software system that was fundamentally incapable of scaling. A SPARQL Endpoint is just such a system.

SPARQL is a rich and expressive querying language, and like most querying languages, it is straightforward to write highly inefficient queries. Various SPARQL engines have mechanisms for protecting against inefficient queries: timeouts, limits to the number of triples returned, but most of these are blunt tools. Applying them gives the user a highly inconsistent experience. A SPARQL endpoint can also take no advantage of returning previously computed results based on knowledge about the data update frequency, or how out-of-date it is acceptable for the data to be.

So if a SPARQL endpoint is ever intended to be successful, and have many (1000+) frequent consumers of data, and remain open to any SPARQL query, it is my opinion that it would be impossible to also have acceptable response times (< 500ms) and reasonable availability (99.99%).

There is a reason there are no ‘SQL Endpoints’.

What are the alternatives?

The main alternative to me is obvious: Open RESTful APIs:

  • Open APIs can provide access to data in only the ways that will scale
  • Open APIs can make generous use of caches to reduce the number of queries being run
  • Open APIs can make use of creative additional ways to combine data from various sources, and hide this complexity from its users
  • Open APIs can continue to provide legacy data structures even if the underlying data has changed. This is important to maintaining APIs over long periods of time.

A second alternative is data dumps. These have limited use, because the data is often not useful until is has undergone processing or ingest into a SPARQL engine.

A third alternative is a self-provisioned SPARQL endpoint. Cloud technologies are making this approach more viable. It would allow a potential data consumer to ‘spin-up’ their own, personal SPARQL endpoint which would be pre-loaded with a periodically updated RDF data dump. This approach allows the provider to massively reduce the cost of supplying and maintaining the endpoint, and a consumer takes responsibility for the stability of their own SPARQL endpoint, without affecting any other consumers.