The Enduring Myth of the SPARQL Endpoint

It surprises me that the Semantic Technology industry still talks with great frequency about the ‘SPARQL Endpoint’ (it’s come up a few times already at SemTech 2013). At best, a SPARQL Endpoint is useful as an ephemeral, unstable method to share your data. At worst, it is wasting the time and energy of providers and consumers of SPARQL endpoints due to the incompatible outcomes of scale and availability.

But before I explain this position, let me outline my understanding of what a SPARQL Endpoint is:

A technical definition:

  • An HTTP URL which accepts a SPARQL query and returns the results
  • Can return a variety of serialisations: Turtle, RDF XML etc

The intention of SPARQL endpoints

  • Give other people and organisations access to your data in a very flexible way
  • Eventually realise the potential of federated SPARQL whereby several SPARQL Endpoints are combined to allow complex queries to be run across a number of datasets
  • They are open for use by a large and varied audience

But what can SPARQL endpoints be used for? They are brilliant for hackdays, prototypes, experiments, toy projects etc. But I don’t think anything ‘real’ could ever be built using one.

There seems to be a cultural acceptance that SPARQL endpoints can be intermittently available, subject to rudimentary DOS attacks and have extremely long response times. This is no foundation for mass adoption of linked data technologies, and it certainly cannot form the fabric of web-based data infrastructure.

I want linked data to gain mainstream popularity. It is a great language for expressing meaningful data and fostering collaboration with data. But to succeed, people need to be able to confidently consume linked data to build apps and services reliably. To build a business on linked data means you need a source of regularly updated and highly available data. This takes investment, by the provider of the data, in highly available, secure and scalable APIs. This is already happening of course, but the SPARQL Endpoint endures.

How do SPARQL endpoints perform?

I thought I’d put my criticisms of SPARQL endpoints to the test, so I tried a few, and here’s what happened…

Note: the queries I have tried are intended to represent an intentional or accidental, rudimentary DOS attack. This is the kind of attack that a robust, open endpoint should be able to protect itself against.

Firstly, only 52% of known SPARQL endpoints were available on http://labs.mondeca.com/sparqlEndpointsStatus/index.html I don’t know how representative that is, but it’s not a good start.

Next, I tried some of the available ones, you’ll have to trust me that I picked these four at random…

(Apologies to the providers of these endpoints, I am not singling you out, I am making a general point).

http://pubmed.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. I ran the query the suggested query, and it hung for around 2 minutes, and then I got

Virtuoso S1T00 Error SR171: Transaction timed out

http://sgd.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. The suggested query ran quickly. I changed it to:

SELECT * WHERE { ?s ?p ?o . }

The results came back quickly, but then the data stopped being streamed back and hung for over 5 minutes, before I stopped waiting.

I then tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

And got:

Virtuoso 42000 Error The estimated execution time -1308622848 (sec) exceeds the limit of 1000 (sec).

That’s an ugly error message, but at least there is a protection mechanism in-place.

http://sparql.data.southampton.ac.uk

It worked fine for some friendly queries, but then I tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

and got:

Error: Connection timed out after 30 seconds in ARC2_Reader missing stream in "getFormat" via ARC2_Reader missing stream in "readStream"

http://bnb.data.bl.uk/sparql

I ran this basic query when I started writing the blogpost:

SELECT * WHERE { ?s ?p ?o . }

It is still failing to load around 10 minutes later.

Update: it was pointed out that the above are all research projects, so I tried data.nature.com/query and http://metis.bbyopen.com/sparql?query= too, and got similar results – connection reset and 60 second+ response times.

The incompatible aims of scale and availability

Whilst “premature optimisation is the root of all evil”, it would be reckless to build a software system that was fundamentally incapable of scaling. A SPARQL Endpoint is just such a system.

SPARQL is a rich and expressive querying language, and like most querying languages, it is straightforward to write highly inefficient queries. Various SPARQL engines have mechanisms for protecting against inefficient queries: timeouts, limits to the number of triples returned, but most of these are blunt tools. Applying them gives the user a highly inconsistent experience. A SPARQL endpoint can also take no advantage of returning previously computed results based on knowledge about the data update frequency, or how out-of-date it is acceptable for the data to be.

So if a SPARQL endpoint is ever intended to be successful, and have many (1000+) frequent consumers of data, and remain open to any SPARQL query, it is my opinion that it would be impossible to also have acceptable response times (< 500ms) and reasonable availability (99.99%).

There is a reason there are no ‘SQL Endpoints’.

What are the alternatives?

The main alternative to me is obvious: Open RESTful APIs:

  • Open APIs can provide access to data in only the ways that will scale
  • Open APIs can make generous use of caches to reduce the number of queries being run
  • Open APIs can make use of creative additional ways to combine data from various sources, and hide this complexity from its users
  • Open APIs can continue to provide legacy data structures even if the underlying data has changed. This is important to maintaining APIs over long periods of time.

A second alternative is data dumps. These have limited use, because the data is often not useful until is has undergone processing or ingest into a SPARQL engine.

A third alternative is a self-provisioned SPARQL endpoint. Cloud technologies are making this approach more viable. It would allow a potential data consumer to ‘spin-up’ their own, personal SPARQL endpoint which would be pre-loaded with a periodically updated RDF data dump. This approach allows the provider to massively reduce the cost of supplying and maintaining the endpoint, and a consumer takes responsibility for the stability of their own SPARQL endpoint, without affecting any other consumers.

Advertisements

25 thoughts on “The Enduring Myth of the SPARQL Endpoint

  1. Heresy Dave! Perhaps? Surely if I have created my SPARQL endpoint I have reached the ultimate goal of opening my data, fulfilled my implementation and I can sit back and relax now?

    I’m being foolish of course. But I wouldn’t underestimate the appeal of such a thing.

    The reason there are no SQL endpoints is that relational systems still generally reside in a de facto closed data culture. In order to run SQL queries I have to have detailed knowledge of your highly optimised database, tables, joins and all, and then even if I am allowed it and the post processing overhead is worth the investment, it’s just easier for me to ask for your data. In csv for example.

    SPARQL on the other hand, by association to RDF – Linked Data – Linked Open Data, resides in an aspirationally open data culture where a SPARQL endpoint makes more sense.

    The real reason the endpoints you have specified do not work is that there is no concerted effort to have them fixed. As an aside, whilst the queries you ran are technically valid, they are far less efficient than providing a data dump, ie all of the data. I don’t believe that they are truly representative of a useful query.

    Which leads me to say – expressive SPARQL or not – I need to have an intrinsic knowledge of your data model to write useful queries. But rather more to the point – I actually have to want and use your data.

    So I think your point is valid, but the reasons are more cultural than technical. I worry that there is a critical mass of data need that we must reach for SPARQL endpoints to become sensible. But the problem is not that the critical mass is large, but we don’t know how large it is.

    Tom

    • Hi Tom,

      Heresy indeed! I guess I am being intentionally a bit provocative, as this sometimes helps. I do think SPARQL Endpoints play an important role in the development of ideas. I guess the main point is that too many people don’t realise that they’re going to have to be thrown away if your data becomes remotely popular.

      I take the cultural point with SQL. My analogy was a bit glib!

      And the ‘prior knowledge’ point is a very good one, although high quality documentation and example uses could help mitigate this.

      I would say that the reasons are both cultural and technical. I think the technical reasons stand up.

      Dave

  2. For any kind of production-grade deployment you are looking at either regular data dumps (that’s what the BBC does) or a SPARQL endpoint with an SLA and access control attached to it. Using public SPARQL endpoints for production is an awful idea. But that’s also true for using any kind of public API (REST, SOAP, whatever). You can also implement caching and access control on SPARQL endpoints, so I am not seeing the benefit of moving to another kind of API. In fact, creating an API will limit the usefulness and the number of implementations for your data. It means that someone will sit down and think about what might be useful for other people and then limit the access by setting up the API. With SPARQL there’s really no limit to what you can query.

    • The caching point is an interesting one.

      The reason I contrast is that I can’t see how a cache can be practically implemented on SPARQL endpoint: the expectation is that the data will be fresh, because you are making a query, and the flexibility of SPARQL means that queries to achieve the same results could be written in different ways (and therefore diluting the cache). But the most fundamental difference is the ability to use domain-informed caching timeouts in an API. If I ask for ‘latest, breaking news’ I’d want a cache timeout of around 30 seconds, if I ask for “The top ten highest mountains in the world” I can probably live with a month-long cache.

      I would disgree that using public APIs in production is also an awful idea. There are several highly successful examples of public RESTful APIs. In fact, some companies have built their entire business on the idea (e.g. Mashery).

  3. I have the same problem with SQL that you have with SPARQL. Whenever I do an SQL SELECT statement asking for all of the data in all of the tables in a particular database, It either takes forever or times out. Outrageous!

  4. Pingback: Friday Links | Meta Rabbit

  5. Hi ! some thoughts :

    1. Your test queries are fetching the entire content of the endpoint; the only acceptable answer of an endpoint to such queries is a timeout or an error.
    2. Open APIs are all different : different parameters, different output formats. SPARQL + RDF standardize this to a standardized query language and standardized output/data formats; taking the best of both world it would make perfect sense for me to see “SPARQL Endpoints”/”Open APIs” accepting a limited subset of SPARQL as a query language, and returning RDF. These would perfectly fit with your Open APIs criterias (scaling, cache, combine data sources, providing legacy data), while still using linked data identifiers and (subset of) sparql query language, thus maximizing the reusability of this endpoint/API;
    3. SQL databases require client drivers; SPARQL does not, since the protocol is also standardized;
    4. I agree I would not put a system in production that rely on an external SPARQL datasource without having garantees that this datasource will scale enough for my system and will answer in acceptable response times; however these technologies also make perfect sense in the context of enterprise, internal, linked data. Using D2RQ server to open and share enterprise application data through SPARQL endpoints is now a common use case.

    • Hi Thomas, thanks for your feedback. I thought I’d respond to some of your points….

      1) I would describe misleading errors and timeouts as generally unacceptable responses for a useful open service. The ideal response would be a meaningful error, indicating that the query is either too complex, or will use up the server resources and affect the experience of other queries being run. However, these types of query of hard to detect efficiently. Most errors seen above are as a result of problems executing the query, so the resources have already been used.

      2) Sometimes standardisation is in competition with the resiliency, efficiency and other pragmatic characteristics of a service. One conclusion of my blogpost would be that SPARQL, as a standard, creates the illusion that a web of interoperating services could be built by connecting various datasources together. The reality, in my opinion, is that SPARQL is too open a standard on which to build genuninely scalable web-based services. Sometimes standardisation is helpful, sometimes it is counter-productive.

      Perhaps your suggestion about sub-setting SPARQL could help with this, but I think that due to the complexity of the SPARQL Query Language, creating subsets that are generally useful would be challenging. If the subsets are a set of allowable templated queries, then this is getting close to the type of API I have been advocating.

      3) If, hypothetically, SQL was a consistently implemented, HTTP-based standard, my argument would remain the same. SQL is a sufficiently open and powerful query language for it to be a bad idea to expose access to it publically. So-called ‘integration in the database’ where multple software systems connect to the same conceptual database using a general purpose query language is considered by many to be architectural anti-pattern. This is because the code of the calling system becomes bound to the structure of the SQL database, adding to the complexity of change.

      4) Following on from my previous point, I personally think that D2RQ or similar SQL-RDF technologies are generally not fit for purpose for agile software systems. You can get from zero to a working system rapidly, but binding your ontological models to the structure of a SQL database is not a good recipe for a longer-term evolving software system.

  6. I think that you’re right in that RESTful data APIs is a much more fruitful direction to pursue in the short term, and in the long term also, I think that most likely, Linked Data applications will mostly use Hypermedia RDF, as I prefer to say.
    Still, let me forward some critisism: First of all, there’s no myth. People are very aware of the shortcomings of SPARQL Endpoints. In fact, database people are coming with their “we tried this 30 years ago, and it didn’t work”, as they always do (don’t they…?) 😉 Mondeca has shown this problem in the blogosphere, and they have a paper on ISWC. Kai-Uwe Sattler’s database group has had papers on it on previous conferences, and yes, the complexity analysis has been done, and we know what will blow up. And like you, everyone who tries this get pretty much the same result. You can put up your own SPARQL endpoint, don’t make it public, and that works ok. And you can use some auxillary data from the LOD cloud, but you can’t rely on it for anything critical. Those who don’t know that already haven’t been paying attention. There’s no myth.
    However, I think that SPARQL is still very strong for some use cases where you need to query a lot of data to drill down to a small result. Moreover, this is the web, there are ideological and societal reasons why we want to have it distributed. The reasons why the web is distributed still holds true for the semantic web. Now, I’m not going to try to convince you that you should give SPARQL another chance, as I do agree that hypermedia RDF is far more urgent, but I’ll justify why my main research topic is still SPARQL federation. And I feel good about it, despite what some of the best database people in the world are saying. 🙂
    First of all, yes the current techniques to fend off attacks are blunt. But they don’t need to be, we just haven’t gone very far. In almost every database, there’s a cost estimation phase, and in that phase, one could do some statistics to estimate the chance of committing the error of rejecting a query that one would be able to answer, and the error of accepting a query that one wouldn’t be. And from there, it is normal risk management. It is clear that on a public endpoint, there are queries you will not answer, but they can be declared in the service description, and as such, you will give your users sufficient predictability. There’s a lot of nice research opportunities here. I wouldn’t at all dismiss it as out of hand.
    Moreover, what you said about caching: “The reason I contrast is that I can’t see how a cache can be practically implemented on SPARQL endpoint”, you should be really careful about using such arguments. It is a good example of an “Argument of Personal Incredulity”: http://rationalwiki.org/wiki/Argument_from_incredulity#Personal_incredulity and whenever you make such and argument, you should think: “Perhaps somebody else can?”. And the answer is “yes”. Just google it. 🙂 And that’s one of the things the database people didn’t have 30 years ago. They didn’t have an Internet full of caching proxies that cache HTTP messages. Now we have that, and it may help a lot.
    Finally, while most of the SPARQL endpoints Mondeca/OKFN regularly queries have long response times for really simple queries, they must have done something fundamentally wrong somewhere in their HTTP stack. There’s just no focus on it. It must be really easy to fix. It can’t be that hard. Really. The harder parts is to accomodate for actual difficult queries.
    I’m pretty sure it can be done, but it’ll take years. Hypermedia can be done today, so you are on the right track!

    • Thank you for this great response, sorry I took so long to approve it.

      Just a couple of niggles:

      I standby my caching point, but I didn’t make it clear enough. So instead…

      I can’t see how a cache can mitigate denial of service attacks on a SPARQL endpoint when the language remains open-ended and is therefore subject to the creation of infinitly variable queries. (I googled this to no avail, so the argument from incredulity stands right?)

      Also, I accept that it is not a universally accepted myth, but some of responses to the post make me think that there’s still a fair number of people who buy it.

      Dave

      • Thanks for the response, Dave! I think your argument has some merit, even though an argument from incredulity cannot ever be valid. 🙂 You could probably prove that an open-ended language, if left open-ended by the endpoint, cannot peruse a cache efficiently. But there are too many assumptions in there when it comes to the real world. One is that we wouldn’t probably let it be open-ended. Another is that you can cache parts of the query (4store does this with great effect, I’ve seen a factor of 50 improved responses on a SPARQL query with an OPTIONAL clause compared to one previous without). Without having done it, I could imagine utilizing ETags to do the same on the open Web. Then comes the factor that many is likely to query the same things, leading to further opportunities for caching. There are some stuff you can do. You probably can’t defeat a DoS attack, like you can by just putting Varnish in front a small, seldom-changing hypermedia API, you’d have to limit what the endpoint can return to do that.
        However, if you get to some of the use cases I work on, I think a hypermedia system will soon run into worse problems than SPARQL endpoints. For example: Find me a romantic weekend getaway for a couple somewhere in Europe (well, the query has to be made more explicit than that, but the basic idea is there). If, for some reason, the one started by interpreting “romantic” and Europe, you’d be lucky, there are probably just a handful that has been named so in some resources that may act as your starting point. But say that you’re not so fortunate, your system starts at your airport, finds all flights that go to certain cities within a short time frame, and within these cities, look into the hotels that have a good rating for couples on TripAdvisor or something. Already there, you’d have retrieved so many resources that SPARQL would probably have been better at it, even though caching could still play a decent role. But it gets only worse when the system starts to inquire about availability and time schedules. Then you get frequently updated information, so caching won’t help a lot.
        Actually, http://yr.no/ has something like this problem. They publish weather forecasts several times a day for 8 million places around the world (using for every place in Geonames, etc.). They get a lot of traffic on single resources, so caching helps them very little. So, in such cases, you have pretty much the same problem as with a SPARQL query: So much variability that it really helps little.
        At this point, you have arrived at one of the main problems that databases are good at: Selectivity estimation so that you can limit the number of GET requests very early. Basically, a hypermedia app must look up Romantic first, then the time schedules for flights to constrain the number of possible cities within your given time constraints, then look up some hotels within the relatively small number of hotels. This is a pretty difficult problem. There’s a lot of good and implemented theory for it when it comes to single databases. It is much more difficult for federated queries (which is what I’m working on) but for hypermedia systems, it is really, really hard. If you do it wrong, I would guess you’d end up with millions of GET requests for a simple question like the above.
        I think hypermedia systems should focus on cases where the scope of the query is pretty constrained from the start. Like, if you can start with a human readable page about Romantic getaways, you’d have much more luck. It might just work to power that by a small number of GET requests to RDF resources to help plan that.
        But there are plenty of cases where your questions are open-ended, or rather open-beginned, and that’s where I think SPARQL is a much more promising direction, because then you have the selectivity problems up front and you can deal with them. Though, I fear that the problem is actually so difficult it’ll take a long time to get there. Like decades… And then, you could argue that narrow-start hypermedia systems would have been a better use of one’s time on earth… 😉

  7. Hi Dave,

    We successfully use SPARQL to back stable APIs over large datasets (see http://dev.openphacts.org). SPARQL and triple stores have made it simpler for us to implement a flexible data integration system. However, we use it in the backend because it’s difficult to a) keep a system endpoint up with arbitrary queries and b) get developers to write performant queries because of the flexibility of the language. As one database expert told me, she loves SPARQL for writing queries – she hates it for optimization

    I think SPARQL has its place. One needs pick and chose the right set of components from any technology stack (including the semantic web) to do the job.

    Finally, one thing to say for SPARQL as a standard is it does allow easy comparison and swapping out of graph stores. Such vendor independence is useful.

  8. It seems to me this post is arguing about the wrong problem.

    Two observations are mentioned: there is an observation that many (actually the majority) of the SPARQL services simply aren’t available, and the other observation is that the ones that are fail to deliver for a particular kind of query.

    From these observations, the conclusion is drawn that SPARQL endpoints are not feasible in general (at least not under the -reasonable- requirements of availability and response time).

    It seems to me this is a false conclusion, or at best, premature.

    From a practical point of view I agree that these observations do not paint a favorable picture of SPARQL endpoints in general, but in order to draw this conclusion, more data is needed indicating why these endpoints aren’t available or fail to deliver. Alternatively, one could attempt to show that SPARQL and SPARQL endpoints have intrinsic flaws that explain lack of availability and bad response times. However, the article does no such thing.

    Instead, an alternative way to access linked data is proposed using “Open RESTful APIs”. To me this suggestion seems to generate more questions than answers.

    The article seems to assume that “Open RESTful APIs” are intrinsically more available and have better response times than SPARQL endpoints in general. However, no argument is given as to why that should be the case. Secondly, “Open RESTful API” sounds like a protocol, whereas SPARQL sounds like a language. I would argue that the many of the benefits of “SPARQL endpoints” have to do with the language SPARQL. If you’re not specifying an alternative language that these “Open RESTful APIs” would provide, it is impossible to choose against SPARQL.

    • Hi Roland,

      I would argue that SPARQL endpoints won’t scale precisely because they use SPARQL, a language. Languages are inherently flexibly tools; they are expressive. It is this expressivity that gives the user of a language the power to perform a huge variety of denial of service attacks on a SPARQL endpoint. REST APIs are a bit more like a protocol, although I would describe them as a service. Services are typically limited to offering only that which the service provider is capable of offering in a sustainable manner.

      Dave

  9. Hi Dave, thanks for the reply!

    You make a fair point about how the expressiveness of the language can pose challenges; however, this problem has been solved already by paging the results. I don’t think SPARQL defines that within the language itself, but that could be a property of the service through which you execute the SPARQL requests.

    With regard to REST APIs being immune to that: I don’t agree – or at least, I think it is a matter of implementation. A Typical REST example includes a “list” operation. It seems to me that such an operation, without any restrictions or paging, leads to exactly the same issues. (Which can be solved by exactly the same solutions too).

    Anyway – I don’t want to drag this on just for the sake of discussion. I think you’re right in the sense that it is much more complex to build an efficient and performant query engine like one would need for SPARQL as compared to a more low level data access API. In that sense there is a practical limit, and perhaps your proposal to use REST API’s instead will solve that or at least offer less of a limitation. But I do not think it is correct to categorically state that this is the fault of SPARQL as language.

    kind regards,

    Roland

    • Thanks again Roland, I agree with your closing comment, as in the post, I intented to make a distinction between SPARQL the query language, and SPARQL endpoints (the full HTTP specification of SPARQL).

      Also, do consider that pagination does not solve the problem of query complexity. With enough data, it is straightforward to write inefficient or complex SPARQL that returns only a few results.

      Dave

  10. Pingback: Vidéo parodique sur la disponibilité de DBPedia – et les services SPARQL | Thomas Francart

  11. Very good point!
    We actually arrived to a very similar solution at Cognitum and our Semantic Server database (Ontorion) is wrapping an SPARQL endpoint (and much more) in an API. So you can make SPARQL queries but also other type of queries only through the API. This gives us the possibility to speed up response time and also to manage the possible queries that are executed.

  12. While sometimes you might experience problems with the SPARQL endpoints, the real frustration is the issue with the long-term availability of the default graph/LOD dataset behind the endpoints. If you check the LOD Cloud Diagram and the Datahub.io registrations of the LOD Cloud datasets, a large share of them is not available anymore.

  13. Pingback: Technical features of a register | Data in government

  14. OK – agreed. The other point about SPARQL is the lack of available client-side tooling.

    However, SPARQL does give a degree of standardisation that is not present within the emerging API infrastructure which I think the above objections have likely alluded to.

    Whilst the establishment of standards for *data* is becoming understood within the community, I feel that we are generally stuck with a blind spot regarding the *APIs* themselves. Generally the strategy seems to be “roll your own” when it comes to RESTful APIs and this approach is not surprising when many big commercial players themselves also do the same thing (e.g. https://dev.twitter.com/rest/public and https://data.police.uk/docs/ )

    This means that you need to be a developer before you get started with the data…not bad if you are creating jobs for developers, but not so good if you are interfacing directly with data scientists.

    The Geo community seems to be more aware of this and there are both standards and client tools ready to exploit geographic data endpoints without developer capability needed. They seem to be ahead of the general non-geographic data community in their understanding?

    As a result, here in Wales we adopted OData as the API service for http://statswales.gov.wales (open a cube view, go down to the bottom and link to the data using the open data tab).

    Also have a look here: https://en.wikipedia.org/wiki/Open_Data_Protocol

    The standard originated from Microsoft (so can be used by an awful lot of users already) but has now been given to OASIS to hopefully remove some of the concerns regarding vendor lock in etc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s