The Enduring Myth of the SPARQL Endpoint

It surprises me that the Semantic Technology industry still talks with great frequency about the ‘SPARQL Endpoint’ (it’s come up a few times already at SemTech 2013). At best, a SPARQL Endpoint is useful as an ephemeral, unstable method to share your data. At worst, it is wasting the time and energy of providers and consumers of SPARQL endpoints due to the incompatible outcomes of scale and availability.

But before I explain this position, let me outline my understanding of what a SPARQL Endpoint is:

A technical definition:

  • An HTTP URL which accepts a SPARQL query and returns the results
  • Can return a variety of serialisations: Turtle, RDF XML etc

The intention of SPARQL endpoints

  • Give other people and organisations access to your data in a very flexible way
  • Eventually realise the potential of federated SPARQL whereby several SPARQL Endpoints are combined to allow complex queries to be run across a number of datasets
  • They are open for use by a large and varied audience

But what can SPARQL endpoints be used for? They are brilliant for hackdays, prototypes, experiments, toy projects etc. But I don’t think anything ‘real’ could ever be built using one.

There seems to be a cultural acceptance that SPARQL endpoints can be intermittently available, subject to rudimentary DOS attacks and have extremely long response times. This is no foundation for mass adoption of linked data technologies, and it certainly cannot form the fabric of web-based data infrastructure.

I want linked data to gain mainstream popularity. It is a great language for expressing meaningful data and fostering collaboration with data. But to succeed, people need to be able to confidently consume linked data to build apps and services reliably. To build a business on linked data means you need a source of regularly updated and highly available data. This takes investment, by the provider of the data, in highly available, secure and scalable APIs. This is already happening of course, but the SPARQL Endpoint endures.

How do SPARQL endpoints perform?

I thought I’d put my criticisms of SPARQL endpoints to the test, so I tried a few, and here’s what happened…

Note: the queries I have tried are intended to represent an intentional or accidental, rudimentary DOS attack. This is the kind of attack that a robust, open endpoint should be able to protect itself against.

Firstly, only 52% of known SPARQL endpoints were available on http://labs.mondeca.com/sparqlEndpointsStatus/index.html I don’t know how representative that is, but it’s not a good start.

Next, I tried some of the available ones, you’ll have to trust me that I picked these four at random…

(Apologies to the providers of these endpoints, I am not singling you out, I am making a general point).

http://pubmed.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. I ran the query the suggested query, and it hung for around 2 minutes, and then I got

Virtuoso S1T00 Error SR171: Transaction timed out

http://sgd.bio2rdf.org/sparql

It took 30 seconds for the query editor to load. The suggested query ran quickly. I changed it to:

SELECT * WHERE { ?s ?p ?o . }

The results came back quickly, but then the data stopped being streamed back and hung for over 5 minutes, before I stopped waiting.

I then tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

And got:

Virtuoso 42000 Error The estimated execution time -1308622848 (sec) exceeds the limit of 1000 (sec).

That’s an ugly error message, but at least there is a protection mechanism in-place.

http://sparql.data.southampton.ac.uk

It worked fine for some friendly queries, but then I tried:

SELECT * WHERE { ?s ?p ?o . ?a ?b ?c . ?e ?f ?g . }

and got:

Error: Connection timed out after 30 seconds in ARC2_Reader missing stream in "getFormat" via ARC2_Reader missing stream in "readStream"

http://bnb.data.bl.uk/sparql

I ran this basic query when I started writing the blogpost:

SELECT * WHERE { ?s ?p ?o . }

It is still failing to load around 10 minutes later.

Update: it was pointed out that the above are all research projects, so I tried data.nature.com/query and http://metis.bbyopen.com/sparql?query= too, and got similar results – connection reset and 60 second+ response times.

The incompatible aims of scale and availability

Whilst “premature optimisation is the root of all evil”, it would be reckless to build a software system that was fundamentally incapable of scaling. A SPARQL Endpoint is just such a system.

SPARQL is a rich and expressive querying language, and like most querying languages, it is straightforward to write highly inefficient queries. Various SPARQL engines have mechanisms for protecting against inefficient queries: timeouts, limits to the number of triples returned, but most of these are blunt tools. Applying them gives the user a highly inconsistent experience. A SPARQL endpoint can also take no advantage of returning previously computed results based on knowledge about the data update frequency, or how out-of-date it is acceptable for the data to be.

So if a SPARQL endpoint is ever intended to be successful, and have many (1000+) frequent consumers of data, and remain open to any SPARQL query, it is my opinion that it would be impossible to also have acceptable response times (< 500ms) and reasonable availability (99.99%).

There is a reason there are no ‘SQL Endpoints’.

What are the alternatives?

The main alternative to me is obvious: Open RESTful APIs:

  • Open APIs can provide access to data in only the ways that will scale
  • Open APIs can make generous use of caches to reduce the number of queries being run
  • Open APIs can make use of creative additional ways to combine data from various sources, and hide this complexity from its users
  • Open APIs can continue to provide legacy data structures even if the underlying data has changed. This is important to maintaining APIs over long periods of time.

A second alternative is data dumps. These have limited use, because the data is often not useful until is has undergone processing or ingest into a SPARQL engine.

A third alternative is a self-provisioned SPARQL endpoint. Cloud technologies are making this approach more viable. It would allow a potential data consumer to ‘spin-up’ their own, personal SPARQL endpoint which would be pre-loaded with a periodically updated RDF data dump. This approach allows the provider to massively reduce the cost of supplying and maintaining the endpoint, and a consumer takes responsibility for the stability of their own SPARQL endpoint, without affecting any other consumers.

Ontologies in software: A conflict of interest?

I build software using semantic technologies. As such, ontologies are at the heart of what I do. For me, ontologies serve two very different roles, and these can often be in conflict. The two roles that I observe are as follows:

  1. Collaborative ontologies: Initially for domain modelling discussion, with the eventual aim of standardisation, either across or within organisations
  2. Embedded ontologies: Form an integral part of a working software system, modelling both the domain, and other application-specific data structures

Below I have described some further characteristics I have observed of these two roles:

Collaborative ontologies

  • An attempt to model an entire domain
  • Discussed by a wide community, who can present edge-cases and suggested changes
  • Change slowly (or never) by necessity, allowing people to publish data without the model becoming out-of-date
  • A tendency to model more so a majority of potential users’ requirements are met
  • Designed in advance of use
  • Aspires to be a truthful representation of the domain
  • An academic approach, whereby ontologies are published by authors
  • Equate to the ‘vision’ in agile software development terms

Embedded ontologies

  • An attempt to model parts of a domain used with a software system
  • Discussed by the community using the system
  • Change frequently, in response to the software design and the requirements of the system
  • An MVO (minimal viable ontology), to match how the ontologies are used in the system
  • Designed in response to requirements of the system
  • Required to balance the demands of a software system, and the desire to be a truthful representation of the domain. For example, performance considerations
  • An engineering approach, whereby ontologies are emergent
  • Equate to the actual software in agile software development terms

In short, I would describe the most fundamental difference as follows:

  • Collaborative ontologies: designed to change as little as possible
  • Embedded ontologies: designed to be to changed as easily as possible

The idea of an embedded ontology might be less familiar, as they are certainly less common. For me, they are simply the data model of a software system, expressed using linked data. This is obviously very much the case where linked data technologies are used, such as Triple Stores or RDF, to power the software.

I think both roles are important and necessary, however the rest of this post is focused on demonstrating the need for embedded ontologies, because these are less widely used and understood.

Relationship between collaborative & embedded ontologies

Most commonly, the distinction is not made collaborative and embedded ontologies, and the conflict takes place over a single ontology. I believe a more productive approach would be to separate the embedded and collaborative ontologies, and to allow each to develop in response to it’s own pressures. The differences that emerge can inform a dialogue between abstract modelling and delivery software.

The use-case I will now present should demonstrate how this separate embedded ontology works in practice.

Embedded ontology use-case: A Sport app

To explain how embedded ontologies work, I will give an example: a software application in the sport domain.

First, to contrast, a collaborative sport ontology would be an attempt to model the domain of sport, comprising teams, sportspeople, disciplines, and perhaps sponsorship, venues, and fixtures.

Let us now assume I wanted to build a software system based around the domain of sport. My first question would be ‘what’s the MVP (minimal viable product)?’. I would hope to get an answer such as “A mobile app showing a list of teams in each English football division”. How would I go about building this? Well, good domain-driven-design would lead me to use a good model and language to describe the parts of my software. I might have a /teams endpoint in an API. I might build a TeamList component in the Javascript code. What I would not do, is attempt to model sportspeople, disciplines, sponsorship, venues and fixtures. I don’t need these yet. What I need is a Minimal Viable Ontology.

This used to be a problem, where SQL database schema were designed in advance of writing any software. Then agile came along, along with Lean, TDD, BDD and so on. These approaches showed that these upfront design approaches usually resulted in the wrong architectural design, badly modelled data structures, features nobody wants and unnecessary complexity.

So, whilst I believe that collaborative ontologies serve an important role in fostering collaboration in linked data, I also believe that we need strategies to make these approaches compatible with embedded ontologies, and therefore software engineering best practice.

Using ontologies in software

I want to now talk about some approaches to building embedded ontologies, used within software. As you will see, these contrast in a number of significant ways with approaches to building ontologies within academic communities.

These approaches are not intended to supercede collaborative ontologies, but instead, help reduce the friction between collaborative and embedded ontologies. And, therefore, between those building software using linked data, and those building models in academia.

Separate embedded & collaborative ontologies

A good first step is to get a distinct understanding of what constitutes the embedded ontology in-use when compared with the collaborative ontology. This could be done in a number of ways:

  • No collaborative ontology until necessary: For me, this is the most important. I believe the preemptive publishing of collaborative ontologies can be counter-productive; if an ontology can remain private and embedded until the software has been proven to work, then it will be a higher-quality representation of the domain. This adheres to the software principle that the best design is the design that emerges in response to iterative requirements. From my software perspective, a published and shared ontology is like an Open API, whilst the benefits of sharing can be huge, you have potentially also lost one of the most important capabilities in building software: the ability to change frequently and easily.
  • Modularisation: I would suggest this as a key second step. Breaking ontologies down into smaller parts, particularly in response to growth. This will reduce the rate at which these modules change individually when compared with the whole. This technique can avoid the necessity for mappings (see below).
  • Mapped: Where a clear mapping between the ontologies is expressed (perhaps using semantic equivalence). This comes with the overhead of maintaining two ontologies and a mapping.

Modularised ontologies

One issue with ontologies is the common practice of attempting to model a single ‘domain’ with a single ontology. A selling point for this is the ability to “work within a single namespace”. I think that this practice is counter-productive, particularly with regards to aligning collaborative and embedded ontologies.

If a modularised approach is taken, a number of options become available. If an ontology can be broken into a number of parts, with references between them, the parts can be assigned metadata to indicate the following:

  • Stability: For example, it should be possible to add a new ‘ontology module’ which is entirely experimental. In sport, this could be a ‘sponsorship’ module which is up for discussion in the community, but forms no part of the working software.
  • Module version: By separately versioning ontology modules, stability can be achieved within some modules, whilst others regularly change.
  • Equivalence/alternatives: Alternative or equivalent modules could exist if different perspectives on the same domain exist.

Dependency management between modules

Whilst I am aware that work has been done in this area, it has not matured to the level seen with library dependency management in software (e.g. Maven, Ivy etc). It is straightforward to build dependency graphs between ontologies, but the more subtle version-specific dependency management is not readily available. More efficient tooling, and a consensus on meta-ontologies in this area would lower the bar for fine-grained modularisation of ontologies.

Finally

It should now be clear that embedded ontologies are a by-product of software delivery. This in my view, is exactly how it should work (as long as the embedded ontologies are curated and crafted as the software grows, using the same principles applied to collaborative ontology design). Using this approach, I would suggest that a higher-quality, more robust ontology may emerge. It will be an ontology that has been road-tested by the necessity to deliver software that serves a particular audience. Perhaps this particular audience will skew the perspective of the ontology, but any model is, after all, just a perspective.

Once this ontology has undergone this growth, and subsequent road-testing, this is where the dialogue can get really interesting between a fit-for-purpose embedded ontology, and a collaborative ontology. Both will have much to offer, but I feel the benefits will flow in both directions.

I would be keen to hear if any of this reflects your own experiences, and, in particular, to get the perspective of individuals working exclusively on collaborative ontologies.

RDF Tree revisited: Developer-friendly JSON or XML from RDF graphs

In my previous post I talked about RDF Tree, an approach to building JSON or XML data from RDF graphs. Having received a number of useful comments, particularly from those involved with JSON-LD, I have revisited the approach and would like to present a revised version.

What is RDF Tree?

RDF Tree is an approach (and a Java library in-development to implement the approach), to producing developer-friendly serialisations of RDF graphs. It is not a serialisation format in itself like JSON-LD, but simply an approach to building predictable, stable JSON and XML representations of graph data.

The aims of this approach are as follows:

  • RDF Tree serialisations are non-semantic
    • Designed to power data-driven visual representation of data such as HTML
    • Designed to be lossy: the RDF graph cannot be recovered from the data
      • It is best practice to offer the data as RDF also for clients that require semantic data
  • RDF Tree is designed to be flexible
    • Whilst there are core principles, different rules, syntax and algorithms can be used to tailor the approach to a specific domain or use-case
  • RDF Trees are either single trees or multiple trees in an ordered list
    • Tree root(s) are indicated in the RDF using the tree ontology (see previous post)
    • For single trees, a specific root resource is known
    • For multiple trees, an ordered list of root resources is known (duplicates allowed)
    • RDF Trees can be built according to different rules
  • The general four rules for constructing the abstract tree from a graph structure are outlined in the previous blogpost.
  • As the rules can vary, there is no one canonical RDF Tree for a given graph input
  • Given a fixed set of rules, RDF Trees are produced as a function of a graph input
    • Rules include:
      • When to stop traversing the graph when building the tree
      • How to ‘canonicalise’ the resulting RDF Tree (e.g. deterministic property ordering)
  • The JSON or XML produced using this approach is largely indistinguishable from ‘vanilla’ JSON or XML
    • No superfluous meta or reference data is provided in order to extract the original graph or understand the specific semantics of the data
    • Designed for use with generic JSON or XML parsing libraries
  • Where naming conflicts exist, stable prefixes are used to distinguish between properties
  • Assumptions are made to optimise the approach
    • All data is considered single-language, different languages can be requested using the Accept-Language header
    • Datatype handling is minimal – datatypes are expected to be predictable
      • No datatypes in XML
      • JSON value types are respected
  • Where possible, the JSON syntax is aligned with JSON-LD, with the principal difference being the absence of the “@context” metadata
  • Inverse properties are included with the “^” prefix in JSON and a inverse=”true” attribute in XML

What does RDF Tree look like?

For the given RDF Turtle input:

@prefix par:     <http://purl.org/vocab/participation/schema#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo:     <http://www.bbc.co.uk/ontologies/geopolitical/> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix domain:  <http://www.bbc.co.uk/ontologies/domain/> .
@prefix oly:     <http://www.bbc.co.uk/ontologies/2012olympics/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sport: <http://www.bbc.co.uk/ontologies/sport/> .
@prefix tree:  <http://purl.org/rdf-tree/> .

tree:tree tree:root <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> .

<http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> a sport:CompetitiveSportingOrganisation ;
      oly:territory <http://www.bbc.co.uk/things/territories/gb#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain> ;
        domain:shortName "Great Britain & N. Ireland"^^xsd:string ;
      domain:name "Team GB"^^xsd:string .

<http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> a sport:Person ;
      par:role_at <http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> ;
      oly:dateOfBirth "1976-10-24"^^xsd:date ;
      oly:gender "M"^^xsd:string ;
      oly:height "172.0"^^xsd:float ;
      oly:weight "72.0"^^xsd:float ;
      domain:name "Ben Ainslie"^^xsd:string ;
      sport:competesIn <http://www.bbc.co.uk/things/2012/sam002#id>, <http://www.bbc.co.uk/things/2012/sam005#id> ;
      sport:discipline <http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9>, <http://www.facebook.com/pages/Ben-Ainslie/108182689201922> ;
      foaf:familyName "Ainslie"^^xsd:string ;
      foaf:givenName "Ben"^^xsd:string .

<http://www.facebook.com/pages/Ben-Ainslie/108182689201922> a domain:Document ; 
   domain:documentType <http://www.bbc.co.uk/things/document-types/external> , <http://www.bbc.co.uk/things/document-types/facebook> .

<http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9> a domain:Document ;
   domain:domain <http://www.bbc.co.uk/things/domains/olympics2012> ;
   domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> .

<http://www.bbc.co.uk/things/2012/sam002#id> a sport:MedalCompetition ;
        domain:name "Sailing - Men's Finn"^^xsd:string ;
        domain:shortName "Men's Finn"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM002000> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn> .

<http://www.bbc.co.uk/things/2012/sam005#id> a sport:MedalCompetition ;
      domain:name "Sailing - Men's 470"^^xsd:string ;
        domain:shortName "Men's 470"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM005000> ;
        domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470> .

<http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> a sport:SportsDiscipline ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing> ;
      domain:name "Sailing"^^xsd:string .

<http://www.bbc.co.uk/things/territories/gb#id> a geo:Territory ;
      domain:name "the United Kingdom of Great Britain and Northern Ireland"^^xsd:string ;
      geo:isInGroup <http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> .

<http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> a geo:Group ;
      domain:name "Europe"^^xsd:string ;
      geo:groupType <http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions> .

The following JSON is produced:

{
  "@id": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id",
  "@type": "http://www.bbc.co.uk/ontologies/sport/Person",
  "dateOfBirth": "1976-10-24",
  "familyName": "Ainslie",
  "gender": "M",
  "givenName": "Ben",
  "height": 172.0,
  "name": "Ben Ainslie",
  "weight": 72.0,
  "competesIn": [
    {
      "@id": "http://www.bbc.co.uk/things/2012/sam002#id",
      "@type": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition",
      "name": "Sailing - Men\u0027s Finn",
      "shortName": "Men\u0027s Finn",
      "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn",
      "externalId": "urn:ioc2012:SAM002000"
    },
    {
      "@id": "http://www.bbc.co.uk/things/2012/sam005#id",
      "@type": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition",
      "name": "Sailing - Men\u0027s 470",
      "shortName": "Men\u0027s 470",
      "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470",
      "externalId": "urn:ioc2012:SAM005000"
    }
  ],
  "discipline": {
    "@id": "http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id",
    "@type": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline",
    "name": "Sailing",
    "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"
  },
  "document": [
    {
      "@id": "http://www.facebook.com/pages/Ben-Ainslie/108182689201922",
      "@type": "http://www.bbc.co.uk/ontologies/domain/Document",
      "documentType": [
        "http://www.bbc.co.uk/things/document-types/facebook",
        "http://www.bbc.co.uk/things/document-types/external"
      ]
    },
    {
      "@id": "http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9",
      "@type": "http://www.bbc.co.uk/ontologies/domain/Document",
      "documentType": "http://www.bbc.co.uk/things/document-types/bbc-document",
      "domain": "http://www.bbc.co.uk/things/domains/olympics2012"
    }
  ],
  "role_at": {
    "@id": "http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id",
    "@type": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation",
    "name": "Team GB",
    "shortName": "Great Britain \u0026 N. Ireland",
    "document": "http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain",
    "territory": {
      "@id": "http://www.bbc.co.uk/things/territories/gb#id",
      "@type": "http://www.bbc.co.uk/ontologies/geopolitical/Territory",
      "name": "the United Kingdom of Great Britain and Northern Ireland",
      "isInGroup": {
        "@id": "http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id",
        "@type": "http://www.bbc.co.uk/ontologies/geopolitical/Group",
        "name": "Europe",
        "groupType": "http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions"
      }
    }
  }
}

And the following XML is produced:

<Person id="http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id">
  <dateOfBirth>1976-10-24</dateOfBirth>
  <familyName>Ainslie</familyName>
  <gender>M</gender>
  <givenName>Ben</givenName>
  <height>172.0</height>
  <name>Ben Ainslie</name>
  <weight>72.0</weight>
  <competesIn>
    <MedalCompetition id="http://www.bbc.co.uk/things/2012/sam002#id">
      <name>Sailing - Men's Finn</name>
      <shortName>Men's Finn</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn"/>
      <externalId id="urn:ioc2012:SAM002000"/>
    </MedalCompetition>
  </competesIn>
  <competesIn>
    <MedalCompetition id="http://www.bbc.co.uk/things/2012/sam005#id">
      <name>Sailing - Men's 470</name>
      <shortName>Men's 470</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470"/>
      <externalId id="urn:ioc2012:SAM005000"/>
    </MedalCompetition>
  </competesIn>
  <discipline>
    <SportsDiscipline id="http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id">
      <name>Sailing</name>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"/>
    </SportsDiscipline>
  </discipline>
  <document>
    <Document id="http://www.facebook.com/pages/Ben-Ainslie/108182689201922">
      <documentType id="http://www.bbc.co.uk/things/document-types/facebook"/>
      <documentType id="http://www.bbc.co.uk/things/document-types/external"/>
    </Document>
  </document>
  <document>
    <Document id="http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9">
      <documentType id="http://www.bbc.co.uk/things/document-types/bbc-document"/>
      <domain id="http://www.bbc.co.uk/things/domains/olympics2012"/>
    </Document>
  </document>
  <role_at>
    <CompetitiveSportingOrganisation id="http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id">
      <name>Team GB</name>
      <shortName>Great Britain &amp; N. Ireland</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain"/>
      <territory>
        <Territory id="http://www.bbc.co.uk/things/territories/gb#id">
          <name>the United Kingdom of Great Britain and Northern Ireland</name>
          <isInGroup>
            <Group id="http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id">
              <name>Europe</name>
              <groupType id="http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions"/>
            </Group>
          </isInGroup>
        </Territory>
      </territory>
    </CompetitiveSportingOrganisation>
  </role_at>
</Person>

Property names

RDF Tree uses the property’s local name as the JSON field name. If a name conflict exists (more than one IRI exists for the same local name), then the IRI prefix is used to distinguish the properties, e.g. “foaf:name” where another “name” exists. A namespace priority list is used to determine which IRI can be expressed as just the local name, and which requires the prefix.

Essentially, no two properties can have the same name. However, property names can vary depending on the presence of other properties with the same local name.

The same approach is used in the XML element names, except the separator char is “-” resulting in disambiguated element names like <foaf-name/>.

Stable property set

Even though naming inconsistencies will be rare, the potential can be reduced by adding properties to a list of ‘stable’ IRIs with prefix and unique local name. This set will contain the definitive set of unambiguous local names. This set will never be visible to users of the data, and is simply there to ensure the stability of the data.

RDF Tree: Developer-friendly graph data

I have now written a follow-up to this post here

I want to make RDF data more developer-friendly. When you show a typical developer RDF, where they have previously been used to simple JSON or XML structures, they find the format confusing, and hard to code with. This is primarily because the data is a graph, and graphs don’t fit well with the tree structures of JSON and XML.

I have seen this problem tackled through the use of libraries that can parse and interpret the graph data, and present an easier interface to the developer. Whilst these have been useful, I still think there are some fundamental problems. JSON-LD also offers a solution to this problem, but is not sufficiently lightweight for environments where data structures change and develop regularly. I compare my approach with JSON-LD at the end of the post.

The problems described below are based on a particular problem I am trying to solve: I want to build developer-friendly APIs over data modelled as RDF, and obtained using SPARQL CONSTRUCT queries.

The problems fall into two categories of API call:

  1. Give me information about X (one thing)
  2. Give me information about all X1, X2, etc.. where all meets a certain condition (list)

Problems with a graph representing one thing

Problem: I don’t know what X is

If I call an API to find out about X, and get back the following RDF:

<uri:x> foaf:name "Bob Smith" .

Then it’s obvious what X is.

If I get this RDF back:

<uri:maybeX> foaf:name "Bob Smith" .
<uri:alsoMaybeX> foaf:name "Jane Smith" .
<uri:maybeX> foaf:knows <uri:alsoMaybeX> .

In this example, I have no idea what X is. This problem is usually solved using a heuristic, such as “the only resource of type T”. But these heuristics can be brittle, and changes to API data could easily break the heuristic.

I proposal to use a tree ontology to indicate the intention overall subject of the tree:

tree:tree tree:root <uri:x> .

Problems with a graph representing a list

Problem: ordering cannot easily expressed

If I call an API to get a list of things, then there is no commonly understood way to indicate the order of those things. The problem with general approaches to ordering in RDF, is that they don’t clearly indicate that the list specifically refers to the ordering of the API response. An API response is typically a dynamic ordering based on the current data, rather than an intrinsic ordering such as the ordering of events. An example would be “Page 4 of all Team GB athletes, ordered by surname, then first name”. In this example, the ordering is dynamic, and only known after the query data has been returned.

I propose to use a tree ontology to indicate a list of nodes in the graph that represent the API results. For example:

tree:tree tree:page "4"^^xsd:int .
tree:tree tree:first <uri:a> .
<uri:a> tree:next <uri:b> .
<uri:b> tree:next <uri:c> .
<uri:a> foaf:name "Person 1 in list" .
<uri:b> foaf:name "Person 2 in list" .
<uri:c> foaf:name "Person 3 in list" .

The additional tree ontology data above can be easily inserted into API responses by adding additional triples to the top of a SPARQL CONSTRUCT query.

Using RDF Tree ontology to build developer friendly data

Using the information from the tree ontology, it is possible to build developer friendly serialisations of the RDF data that are more directly usable for typical use-cases such as building tabular HTML or other visual representations of the data.

Building the RDF Tree from an RDF graph

The RDF tree is built by starting at the root (or multiple roots for lists) and traversing the graph until a rule causes the traversal to halt. The graph is traversed, breadth-first, following predicates in both directions. I used the following three halting rules, which seems to make effective trees for the use-cases I have looked at:

  1. If a parent node is present in the tree with the same resource as one traversed to, halt the traversal
  2. Do not follow the rdf:type property in the inverse direction
  3. If a list item resource is defined with the same resource as one traversed to, halt the traversal after one additional level in the tree

The following two examples show JSON RDF Tree serialisations of Olympics data, applying the rules above. The original Turtle data is shown, along with the resultant, proposed RDF Tree serialisation. The ‘^’ symbol is used to represent triples followed in the inverse direction of the predicate.

One athlete as RDF Turtle

@prefix par:     <http://purl.org/vocab/participation/schema#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo:     <http://www.bbc.co.uk/ontologies/geopolitical/> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix domain:  <http://www.bbc.co.uk/ontologies/domain/> .
@prefix oly:     <http://www.bbc.co.uk/ontologies/2012olympics/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sport: <http://www.bbc.co.uk/ontologies/sport/> .
@prefix tree:  <http://purl.org/rdf-tree/> .

tree:tree tree:start <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> .

<http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> a sport:CompetitiveSportingOrganisation ;
      oly:territory <http://www.bbc.co.uk/things/territories/gb#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain> ;
        domain:shortName "Great Britain & N. Ireland"^^xsd:string ;
      domain:name "Team GB"^^xsd:string .

<http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> a sport:Person ;
      par:role_at <http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> ;
      oly:dateOfBirth "1976-10-24"^^xsd:date ;
      oly:gender "M"^^xsd:string ;
      oly:height "172.0"^^xsd:float ;
      oly:weight "72.0"^^xsd:float ;
      domain:name "Ben Ainslie"^^xsd:string ;
      sport:competesIn <http://www.bbc.co.uk/things/2012/sam002#id>, <http://www.bbc.co.uk/things/2012/sam005#id> ;
      sport:discipline <http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9>, <http://www.facebook.com/pages/Ben-Ainslie/108182689201922> ;
      foaf:familyName "Ainslie"^^xsd:string ;
      foaf:givenName "Ben"^^xsd:string .

<http://www.facebook.com/pages/Ben-Ainslie/108182689201922> a domain:Document ; 
   domain:documentType <http://www.bbc.co.uk/things/document-types/external> , <http://www.bbc.co.uk/things/document-types/facebook> .

<http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9> a domain:Document ;
   domain:domain <http://www.bbc.co.uk/things/domains/olympics2012> ;
   domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> .

<http://www.bbc.co.uk/things/2012/sam002#id> a sport:MedalCompetition ;
        domain:name "Sailing - Men's Finn"^^xsd:string ;
        domain:shortName "Men's Finn"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM002000> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn> .

<http://www.bbc.co.uk/things/2012/sam005#id> a sport:MedalCompetition ;
      domain:name "Sailing - Men's 470"^^xsd:string ;
        domain:shortName "Men's 470"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM005000> ;
        domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470> .

<http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> a sport:SportsDiscipline ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing> ;
      domain:name "Sailing"^^xsd:string .

<http://www.bbc.co.uk/things/territories/gb#id> a geo:Territory ;
      domain:name "the United Kingdom of Great Britain and Northern Ireland"^^xsd:string ;
      geo:isInGroup <http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> .

<http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> a geo:Group ;
      domain:name "Europe"^^xsd:string ;
      geo:groupType <http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions> .

One athlete RDF Tree JSON

{
  "rdf:about": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id",
  "domain:name": "Ben Ainslie",
  "foaf:familyName": "Ainslie",
  "foaf:givenName": "Ben",
  "oly:dateOfBirth": "1976-10-24",
  "oly:gender": "M",
  "oly:height": "172.0",
  "oly:weight": "72.0",
  "domain:document": [
    {
      "rdf:about": "http://www.facebook.com/pages/Ben-Ainslie/108182689201922",
      "domain:documentType": [
        {
          "rdf:about": "http://www.bbc.co.uk/things/document-types/facebook"
        },
        {
          "rdf:about": "http://www.bbc.co.uk/things/document-types/external"
        }
      ],
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/domain/Document"
      }
    },
    {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9",
      "domain:documentType": {
        "rdf:about": "http://www.bbc.co.uk/things/document-types/bbc-document"
      },
      "domain:domain": {
        "rdf:about": "http://www.bbc.co.uk/things/domains/olympics2012"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/domain/Document"
      }
    }
  ],
  "par:role_at": {
    "rdf:about": "http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id",
    "domain:name": "Team GB",
    "domain:shortName": "Great Britain \u0026 N. Ireland",
    "domain:document": {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain"
    },
    "oly:territory": {
      "rdf:about": "http://www.bbc.co.uk/things/territories/gb#id",
      "domain:name": "the United Kingdom of Great Britain and Northern Ireland",
      "geo:isInGroup": {
        "rdf:about": "http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id",
        "domain:name": "Europe",
        "geo:groupType": {
          "rdf:about": "http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions"
        },
        "rdf:type": {
          "rdf:about": "http://www.bbc.co.uk/ontologies/geopolitical/Group"
        }
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/geopolitical/Territory"
      }
    },
    "rdf:type": {
      "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation"
    }
  },
  "sport:competesIn": [
    {
      "rdf:about": "http://www.bbc.co.uk/things/2012/sam002#id",
      "domain:name": "Sailing - Men\u0027s Finn",
      "domain:shortName": "Men\u0027s Finn",
      "domain:document": {
        "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn"
      },
      "domain:externalId": {
        "rdf:about": "urn:ioc2012:SAM002000"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition"
      }
    },
    {
      "rdf:about": "http://www.bbc.co.uk/things/2012/sam005#id",
      "domain:name": "Sailing - Men\u0027s 470",
      "domain:shortName": "Men\u0027s 470",
      "domain:document": {
        "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470"
      },
      "domain:externalId": {
        "rdf:about": "urn:ioc2012:SAM005000"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition"
      }
    }
  ],
  "sport:discipline": {
    "rdf:about": "http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id",
    "domain:name": "Sailing",
    "domain:document": {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"
    },
    "rdf:type": {
      "rdf:about": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline"
    }
  },
  "rdf:type": {
    "rdf:about": "http://www.bbc.co.uk/ontologies/sport/Person"
  }
}

List of three athletes as RDF Turtle

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix par: <http://purl.org/vocab/participation/schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix domain: <http://www.bbc.co.uk/ontologies/domain/> .
@prefix sport: <http://www.bbc.co.uk/ontologies/sport/> .
@prefix oly: <http://www.bbc.co.uk/ontologies/2012olympics/> .
@prefix geo-pol: <http://www.bbc.co.uk/ontologies/geopolitical/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix tree:  <http://purl.org/rdf-tree/> .

tree:tree tree:first <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> .
<http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> tree:next <http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id> .
<http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id> tree:next <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> .

<http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> a sport:Person ;
    oly:isMedallistFor <http://www.bbc.co.uk/things/3143ecc6-ffa4-4446-9d13-ee60801c4881#id> ;
    domain:name "Ben Ainslie"^^xsd:string ;
    domain:canonicalName "Ben Ainslie"^^xsd:string ;
    foaf:givenName "Ben"^^xsd:string ;
    foaf:familyName "Ainslie"^^xsd:string ;
    oly:gender "M"^^xsd:string ;
    sport:discipline <http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> ;
    domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9> .

<http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> a sport:SportsDiscipline ;
  domain:name "Sailing"^^xsd:string ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing> .

_:node166n6c6vlx37 a sport:CompetesForRole ;
    par:holder <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> ;
    par:role_at <http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> .

<http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> a sport:CompetitiveSportingOrganisation ;
  domain:name "Team GB"^^xsd:string ;
  domain:shortName "Great Britain & N. Ireland"^^xsd:string ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain> .

<http://www.bbc.co.uk/things/8d6ae957-d338-442b-99d7-190f20b78dd4#id> oly:oneToWatch <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> .

<http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> a sport:Person ;
  domain:name "Usain Bolt"^^xsd:string ;
  domain:canonicalName "Usain Bolt"^^xsd:string ;
    foaf:givenName "Usain"^^xsd:string ;
    foaf:familyName "Bolt"^^xsd:string ;
  oly:gender "M"^^xsd:string ;
  oly:worldOlympicDream "true"^^xsd:boolean ;
  sport:discipline <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/82f5db84-0591-49ee-b6f4-a1d26e9381fb> .

<http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> a sport:SportsDiscipline ;
    domain:name "Athletics"^^xsd:string ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics> .

_:node166n6c6vlx113 a sport:CompetesForRole ;
    par:holder <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> ;
    par:role_at <http://www.bbc.co.uk/things/76369f3b-65a0-4e69-8c52-859adfdefa49#id> .

<http://www.bbc.co.uk/things/76369f3b-65a0-4e69-8c52-859adfdefa49#id> a sport:CompetitiveSportingOrganisation ;
    domain:name "Jamaica"^^xsd:string ;
  domain:shortName "Jamaica"^^xsd:string ;
  domain:externalId <urn:ioc2012:jam> ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/countries/jamaica> .

<http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id> a sport:Person ;
  oly:isMedallistFor <http://www.bbc.co.uk/things/3143ecc6-ffa4-4446-9d13-ee60801c4881#id> ;
  domain:name "Majlinda Kelmendi"^^xsd:string ;
  domain:canonicalName "Majlinda Kelmendi"^^xsd:string ;
    foaf:givenName "Majlinda"^^xsd:string ;
    foaf:familyName "Kelmendi"^^xsd:string ;
  oly:gender "W"^^xsd:string ;
  sport:discipline <http://www.bbc.co.uk/things/654f550c-0c2d-2341-a8f5-66e3a9ba28ba#id> ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/f2798806-4e54-47ff-a1ec-32beefde2058> .

<http://www.bbc.co.uk/things/654f550c-0c2d-2341-a8f5-66e3a9ba28ba#id> a sport:SportsDiscipline ;
    domain:name "Judo"^^xsd:string ;
  domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/judo> .

_:node166n6c6vlx445 a sport:CompetesForRole ;
    par:holder <http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id> ;
    par:role_at <http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> .

List of three athletes RDF Tree JSON

Note the repeating country nodes: these are useful for tabular representations.

[
  {
    "rdf:about": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id",
    "domain:canonicalName": "Ben Ainslie",
    "domain:name": "Ben Ainslie",
    "foaf:familyName": "Ainslie",
    "foaf:givenName": "Ben",
    "oly:gender": "M",
    "^oly:oneToWatch": {
      "rdf:about": "http://www.bbc.co.uk/things/8d6ae957-d338-442b-99d7-190f20b78dd4#id"
    },
    "^par:holder": {
      "par:role_at": {
        "rdf:about": "http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id",
        "domain:name": "Team GB",
        "domain:shortName": "Great Britain \u0026 N. Ireland",
        "^par:role_at": {
          "par:holder": {
            "rdf:about": "http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id"
          },
          "rdf:type": {
            "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetesForRole"
          }
        },
        "domain:document": {
          "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain"
        },
        "rdf:type": {
          "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation"
        }
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetesForRole"
      }
    },
    "domain:document": {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9"
    },
    "oly:isMedallistFor": {
      "rdf:about": "http://www.bbc.co.uk/things/3143ecc6-ffa4-4446-9d13-ee60801c4881#id",
      "^oly:isMedallistFor": {
        "rdf:about": "http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id"
      }
    },
    "sport:discipline": {
      "rdf:about": "http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id",
      "domain:name": "Sailing",
      "domain:document": {
        "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline"
      }
    },
    "rdf:type": {
      "rdf:about": "http://www.bbc.co.uk/ontologies/sport/Person"
    }
  },
  {
    "rdf:about": "http://www.bbc.co.uk/things/f2798806-4e54-47ff-a1ec-32beefde2058#id",
    "domain:canonicalName": "Majlinda Kelmendi",
    "domain:name": "Majlinda Kelmendi",
    "foaf:familyName": "Kelmendi",
    "foaf:givenName": "Majlinda",
    "oly:gender": "W",
    "^par:holder": {
      "par:role_at": {
        "rdf:about": "http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id",
        "domain:name": "Team GB",
        "domain:shortName": "Great Britain \u0026 N. Ireland",
        "^par:role_at": {
          "par:holder": {
            "rdf:about": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id"
          },
          "rdf:type": {
            "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetesForRole"
          }
        },
        "domain:document": {
          "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain"
        },
        "rdf:type": {
          "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation"
        }
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetesForRole"
      }
    },
    "domain:document": {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/athletes/f2798806-4e54-47ff-a1ec-32beefde2058"
    },
    "oly:isMedallistFor": {
      "rdf:about": "http://www.bbc.co.uk/things/3143ecc6-ffa4-4446-9d13-ee60801c4881#id",
      "^oly:isMedallistFor": {
        "rdf:about": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id"
      }
    },
    "sport:discipline": {
      "rdf:about": "http://www.bbc.co.uk/things/654f550c-0c2d-2341-a8f5-66e3a9ba28ba#id",
      "domain:name": "Judo",
      "domain:document": {
        "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/judo"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline"
      }
    },
    "rdf:type": {
      "rdf:about": "http://www.bbc.co.uk/ontologies/sport/Person"
    }
  },
  {
    "rdf:about": "http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id",
    "domain:canonicalName": "Usain Bolt",
    "domain:name": "Usain Bolt",
    "foaf:familyName": "Bolt",
    "foaf:givenName": "Usain",
    "oly:gender": "M",
    "oly:worldOlympicDream": "true",
    "^par:holder": {
      "par:role_at": {
        "rdf:about": "http://www.bbc.co.uk/things/76369f3b-65a0-4e69-8c52-859adfdefa49#id",
        "domain:name": "Jamaica",
        "domain:shortName": "Jamaica",
        "domain:document": {
          "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/countries/jamaica"
        },
        "domain:externalId": {
          "rdf:about": "urn:ioc2012:jam"
        },
        "rdf:type": {
          "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation"
        }
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/CompetesForRole"
      }
    },
    "domain:document": {
      "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/athletes/82f5db84-0591-49ee-b6f4-a1d26e9381fb"
    },
    "sport:discipline": {
      "rdf:about": "http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id",
      "domain:name": "Athletics",
      "domain:document": {
        "rdf:about": "http://www.bbc.co.uk/sport/olympics/2012/sports/athletics"
      },
      "rdf:type": {
        "rdf:about": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline"
      }
    },
    "rdf:type": {
      "rdf:about": "http://www.bbc.co.uk/ontologies/sport/Person"
    }
  }
]

Syntax and feedback

I have written a library to perform RDF to RDF Tree conversion, and I am looking for feedback on the syntax and general approach before I finalise the details.

The design of RDF Tree is intended to be lossless, from an information perspective, but to maximise practical use by developers.

Comparison to JSON LD

This approach above is a lightweight alternative to JSON LD. The framing approached used by JSON LD to assure consistent and predictable access to data is replaced with a more dynamic approach where the tree root is defined by the RDF API response. I have identified following differences to JSON LD:

  • RDF Tree is focused on developer simplicity: this includes the creators and consumers of APIs. An example of this would be the decision to continue to use prefixes for predicates. This makes the data easy to produce, and avoid the complexities of ‘contextualisation’ of shorter predicate names.
  • RDF Tree is not intended to be re-used as RDF data. APIs that produce RDF Tree would need to also produce RDF. Those interested in tree-like or tabular data would use RDF Tree, those interested in the graph data would use the RDF.
  • As the ontology structures and SPARQL queries change, little or no maintenance is required to continue to support predictable response data.
  • Any tree structured data can be produced (XML can be built from the same abstract tree)

Updating RDF data using objects

A scenario that repeatedly arises when working with triple stores, to store and query RDF data, is the handling of updates to the data. The triple store can effectively be regarded as a single, large graph, where updates come in the following forms:

  • Add this new thing
  • Remove this old thing
  • Update thing X

Essentially the CUD of CRUD. where the R is covered by SPARQL SELECT, DESCRIBE, CONSTRUCT and ASK.

The problem comes with defining the ‘thing’. When considering the ‘thing’, for example a person or place, the underlying philosophies of Linked Data lie in contrast to the aim of clearly defining the properties and structure of the ‘thing':

  • Anyone can say anything about anything
  • Data can be sparse
  • Data can be inconsistent, or contradictory

The additional complexity of clearly defining the shape of a ‘thing’, I think, explain why approaches to RDF updates often avoid the CRUD approach, and take more direct, but problematic approaches.

Problematic approaches to RDF updates

Heavy lifting

The ingest of huge datasets using one-off processes such as data-dumps or R2RML.

This approach will not scale to frequently updating systems. It might be workable for data warehousing, data analysis or systems with very static data needs, but fast-moving systems with frequent changes are disrupted by large updates. This is particularly so with the majority of triple store implementations, where the ingest of large datasets is an effective block on all other updates.

Fine grain

Using RDF change-sets, SPARQL 1.1 update queries or equivalent to alter the state of a triple store in increments.

The problem here, is that the process must be treated like a dbdeploy script, each update must be processed in a precise order to a known state. A change-set can corrupt the data if applied before or after it is supposed to be. In practice, this makes the use of this approach for updates brittle and error-prone. One mistake, and you’re in rollback-hell, and managing this across multiple environments, each with differing states, is overly-complex.

Three approaches to RDF updates

A solution – object updates

One solution is to take a step back from the data, and think about how you might want the updates to happen from the perspective of the domain. For example, take MusicBrainz data, which will follow a very particular update pattern:

  • Updates clustered around musical artists or groups will be common
  • Updates spread widely and thinly across musical artists and group will be rare

Therefore, an update feed that provides the complete information for a musical artist or group each time any part of it is updated, will be a reasonably efficient method of performing updates. The overhead is that somehow, a decision needs to be made about how to chop up the data within the domain of music, raising questions such as:

  • Which ‘things’ will be chosen as the unit-of-update?
  • Which properties belong to which ‘things’?

To be efficient, it is important to update any given property along with the ‘thing’ with which is it most likely to be updated.

What is described here is essentially a form of object-graph-mapping, where the ‘object’ is synonymous with the previously discussed ‘thing’. The domain is divided up into a set of object classes, where the data is deterministically assigned to instances of these object classes.

To continue with the music example, the domain could be split into artists and compositions, with a property like ‘compose’ connecting artists with compositions, considered part of the composition object. This is because, the most likely update scenario is that a new composition is added to the database, with the artist assigned when the composition is added. A less likely scenario is that all the compositions an artist produced will be added or updated along with the artists’ basic information like name and date of birth. Clearly this will sometimes be a subjective decision.

Advantages of object updates

  • A sub-set of the data can be held by a system consuming the updates
  • If the updates are processed out-of-order, or not from the beginning
    • The resulting data is likely to be eventually consistent
    • The resulting data remains structurally sound
  • Updates scale regardless of the total size of the complete dataset

Disadvantages of object updates

  • The data has to be divided up into object classes, which need to be designed and maintained
  • Additional data is required to perform an update – it must be clear which data in a triple store needed to be removed before applying the new set of data (see below)

The advantages above show that this approach solves the problems with both the heavy-weight and fine-grained approaches.

What would an object update feed look like?

An update feed, as described above, would be simple:

  • Each update would be an (ideally) small graph representing one ‘object’, such as an individual artist or recording
  • Metadata would be provided to indicate required additional information such as:
    • Type of ‘object’
    • Who made the change
    • When the change was made
  • Allowing activities such as:
    • Auditing
    • Filtering
    • Conversion
    • Validation

Performing updates with a object update feed

As indicated in the disadvantages above, in order to update an object, it is necessary to know exactly which triple need to be removed – the triples that describe the previous version of the object. I have seen two approaches here, the first is to use a SPARQL query to pull in the triples for removal. This approach is just as brittle at the fine-grained approach, so I would not recommend it. A better alternative is to use context or named graphs – where each object resides in its own context, which can be removed on update.

I am of the opinion that context should be used exclusively for data management within a triple store. But perhaps I will leave that discussion for a later post.

Introducing Tripliser

I recently had to solve the problem of how to take XML, in a predefined format, and create RDF representing the semantics of the data. I began using XSLT, but gradually the edge cases to handle inconsistencies in the input XML caused the XLST to become verbose and incomprehensible (being a mix of syntax handling and business logic). Errors were hard to diagnose and failures were not effectively recovered from. I decided to write a library to help me with this problem, called Tripliser…

>> Homepage  |  >> GitHub

Tripliser is a Java library and command-line tool for creating triple graphs, and RDF serialisations, from XML source data. It is particularly suitable for data exhibiting any of the following characteristics:

  • Messy – missing data, badly formatted data, changeable structure
  • Bulky – large volumes of data
  • Volatile – ongoing changes to data and structure, e.g. feeds

Other non-RDF source data may be supported in future such as CSV and SQL databases.

It is designed as an alternative to XSLT conversion, providing the following advantages:

  • Easy-to-read mapping format – concisely describing each mapping
  • Robust – error or partial failure tolerant
  • Detailed reporting – comprehensive feedback on the successes and failures of the conversion process
  • Extensible – custom functions, flexible API
  • Efficient – facilities for processing data in large volumes with minimal memory usage

XML files are read in, and XPath is used to extract values which can be inserted into a triple graph. The graph can be serialised in various RDF formats and is accompanied by meta-data and a property-by-property report to indicate how successful or unsuccessful the mapping process was.

Data flow in Tripliser

Here’s what a typical mapping format looks like…

<?xml version="1.0" encoding="UTF-8"?>
<rdf-mapping xmlns="http://www.daverog.org/rdf-mapping" strict="false">
	<constants>
		<constant name="objectsUri" value="http://objects.theuniverse.org/" />
	</constants>
	<namespaces>
		<namespace prefix="xsd" url="http://www.w3.org/2001/XMLSchema#" />
		<namespace prefix="rdfs" url="http://www.w3.org/2000/01/rdf-schema#" />
		<namespace prefix="dc" url="http://purl.org/dc/elements/1.1/" />
		<namespace prefix="universe" url="http://theuniverse.org/" />
	</namespaces>
	<graph query="//universe-objects" name="universe-objects" comment="A graph for objects in the universe">
		<resource query="stars/star">
			<about prepend="${objectsUri}" append="#star" query="@id" />
			<properties>
				<property name="rdf:type" resource="true" value="universe:Star"/>
				<property name="dc:title" query="name" />
				<property name="universe:id" query="@id" />
				<property name="universe:spectralClass" query="spectralClass" />
			</properties>
		</resource>
		<resource query="planets/planet">
			<about prepend="${objectsUri}" append="#planet" query="@id" />
			<properties>
				<property name="rdf:type" resource="true" value="universe:Planet"/>
				<property name="dc:title" query="name" />
				<property name="universe:id" query="@id" />
				<property name="universe:adjective" query="adjective" />
				<property name="universe:numberOfSatellites" dataType="xsd:int" query="satellites" />
			</properties>
		</resource>
	</graph>
</rdf-mapping>

Go to the Homepage or to GitHub to find out more.

Automated, test-driven, server-side Javascript with Maven, jsUnit & Sinon.JS

Here’s some background to a problem I faced recently: I have quite a few server-side Javascript scripts which I need to expand and refactor. Having moved from my usual comfort-zone of test-driven Java, I wanted to work in the same style to ensure the quality of the scripts I was writing.

The following describes how I reached a solution, with the many blind-alleys and wrong-turnings I took, ignored for simplicity.

Step 1: Find a Javascript unit-testing library

I needed a library that would work for server-side Javascript. The execution environment of the production code is Rhino, so I needed a compatible unit-testing framework. Unfortunately, what seem like the better Javascript testing frameworks, are focused on the problem of client-side multi-browser testing. These frameworks often adopt a server model, allowing tests to be submitted and run on different browsers. Even a headless browser would not match the specific environment (Rhino) so these frameworks were ruled out.

I found a couple of interesting projects that I began to look into:

https://github.com/stefanofornari/rhinounit-maven-plugin
https://github.com/stefanofornari/rhinounit

and, in the absence of documentation, a project that used these:

https://github.com/stefanofornari/subitosms-thunderbird-extension

Step 2: Get a simple test to run with Maven

By installing first RhinoUnit, and then the Maven plugin, in my local repository, I was able to incorporate Javascript tests into my build. I did this using the following Maven plugin in my pom.xml:

<plugin>
	<groupId>funambol</groupId>
	<artifactId>rhinounit-maven-plugin</artifactId>
	<version>1.0</version>
	<executions>
		<execution>
			<phase>test</phase>
			<goals>
				<goal>test</goal>
			</goals>
		</execution>
	</executions>
	<configuration>
		<testSourceDirectory>src/test/scripts</testSourceDirectory>
		<includes>
			<include>**/tools/datadictionary.lib.js</include>
		</includes>
	</configuration>
</plugin>

Note: implicit in the Maven plugin are two directories:

  • src/main/scripts, the directory relative to which the includes are applied
  • src/main/js, where the tests reside. Can be overidden, as I have above, using the testSourceDirectory configuration option.

Given the pom.xml configuration above, to run a basic test I need the following…

1: A valid project structure, e.g.

[project]
  - pom.xml
  - src/main/scripts/my/tools/datadictionary.lib.js
  - src/test/scripts/DataDictionaryTestSuite.js

2: A valid pom.xml

3: DataDictionaryTestSuite.js contains a test such as:

function DataDictionaryTestSuite() {
}

DataDictionaryTestSuite.prototype.test1 = function test1() {
  assertTrue(true);
}

Then everything should be ready to start automated testing.

Run ‘mvn test’, and you should see output like the following:

[INFO] [rhinounit:test {execution: default}]
> Initializing...
> Done initializing
> Running test "test1"
<testsuite time="0.003" failures="0" errors="0" tests="1" name="DataDictionaryTestSuite">
<testcase time="0" name="test1">
</testcase>
</suite>
> Done (0.005 seconds)
--------------------------------------------------------------------------------
Tests run: 1, Failures: 0, Errors: 0
--------------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------

Step 3: Write some real tests

The next step was to start writing some useful tests. I wanted to test the ‘delete’ action for a file management API. The API I was writing was dependent on an underlying API providing the raw functionality, this API was not Javascript, but Java objects injected into a Rhino as Javascript placeholders. As such, the underlying API needed to be stubbed. I decided to use a very comprehensive and test-framework-agnositic library for creating spies, mocks and stubs: Sinon.js. This library made the following tests possible, the first to test successful deletion, the second to test failed deletion (file not found):

DataDictionaryTestSuite.prototype.testDeletion = function testDeletion() {
	root.childByNamePath = sinon.stub().withArgs("/file1").returns(file);
	dataDictionaryDir.removeNode = sinon.spy().withArgs(file);

	var model = new dataDictionaryModel();
	var message = model.deleteFile("file1");

	assertEquals("File 'file1' deleted", message);
	assertTrue("removeNode(file) not called", 	
		dataDictionaryDir.removeNode.withArgs(file).calledOnce);
}

DataDictionaryTestSuite.prototype.testDeletionOfNonExistantFileReturns404Error = 
		function testDeletionOfNonExistantFileReturns404Error() {
	root.childByNamePath = sinon.stub().withArgs("/file1").returns(null);
	dataDictionaryDir.removeNode = sinon.spy();

	var model = new dataDictionaryModel();

	try {
		model.deleteFile("file1");
		fail("An exception should have been thrown");
	} catch(error) {
		assertEquals(404, error["code"]);
		assertEquals("Unable to delete file 'file1' does not exist", 
			error["message"]);
		assertTrue("removeNode(...) should not have been called", 
			dataDictionaryDir.removeNode.callCount==0);
	}
}

The assert… statements are provided by jsUnit and require no additional configuration. The stubs, spies and verifications are provided by Sinon.JS and are documented here.

I won’t go into the syntax of the tests above in too much detail, but essentially I am stubbing out the underlying API and ensuring the following:

  • A ‘removeNode’ action occurs for a successful deletions
  • A ‘removeNode’ action does not occur if the file could not be found
  • The response messages are relevant to the outcome

To use sinon.js, I found I needed to do the following adjustments:

1: Remove some code from sinon.js that conflicts with Rhino. The following extract is from line 1601 of sinon-1.1.1.js. Remove this code, and several of the associated functions. I am not sure exactly what should be removed, and even less sure of whether I am undermining the sinon library, but removing a few functions worked for me.

sinon.timers = {
    setTimeout: setTimeout,
    clearTimeout: clearTimeout,
    setInterval: setInterval,
    clearInterval: clearInterval,
    Date: Date
};

2: Locate sinon.js in /src/main/scripts and add the following include:

<include>sinon.js</include>

Step 4: Inject global mocks

You’ll notice that all the file API tests above refer to a variable ‘root’. This is a global variable which is injected into the Rhino context on the production system. For the Javascript in the includes directories to run, these global variables need to be present. I found it necessary to create a ‘global-mocks.js’ file in src/main/scripts containing the following:

var root = new Object();

Step 5: Run the tests

A failed test run, where the following assertion fails:

assertEquals("Unable to delete file 'file1' does not exist", error["message"]);

Will give you test output to indicate the nature of the failure:

[INFO] [rhinounit:test {execution: default}]
> Initializing...
> Done initializing
> Running test "testDeletionOfNonExistantFileReturns404Error"
testDeletionOfNonExistantFileReturns404Error failed
[object Object]
> Running test "testDeletion"
<testsuite time="0.05" failures="1" errors="0" tests="2" name="DataDictionaryTestSuite">
<testcase time="0.031" name="testDeletionOfNonExistantFileReturns404Error">
<failure type="jsUnitException">
Expected Unable to delete file 'file1' does not exist (string) but was Unable to delete file 'file1' not found (string)
</failure>
</testcase>
<testcase time="0.015" name="testDeletion">
</testcase>
</suite>
> Done (0.057 seconds)
--------------------------------------------------------------------------------
Tests run: 2, Failures: 1, Errors: 0
--------------------------------------------------------------------------------

WARNING: There are test failures.
--------------------------------------------------------------------------------

Finally, a successful build…

[INFO] [rhinounit:test {execution: default}]
> Initializing...
> Done initializing
> Running test "testDeletionOfNonExistantFileReturns404Error"
> Running test "testDeletion"
<testsuite time="0.045" failures="0" errors="0" tests="2" name="DataDictionaryTestSuite">
<testcase time="0.027" name="testDeletionOfNonExistantFileReturns404Error">
</testcase>
<testcase time="0.016" name="testDeletion">
</testcase>
</suite>
> Done (0.055 seconds)
--------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 0
--------------------------------------------------------------------------------

Conclusion

This was my first attempt at test-driven Javascript, and I’d love to get some feedback if there are cleaner ways to achieve this, or even just some alternatives.