Ontologies in software: A conflict of interest?

Posted on May 30, 2013 by daverog

I build software using semantic technologies. As such, ontologies are at the heart of what I do. For me, ontologies serve two very different roles, and these can often be in conflict. The two roles that I observe are as follows:

Collaborative ontologies: Initially for domain modelling discussion, with the eventual aim of standardisation, either across or within organisations
Embedded ontologies: Form an integral part of a working software system, modelling both the domain, and other application-specific data structures

Below I have described some further characteristics I have observed of these two roles:

Collaborative ontologies

An attempt to model an entire domain
Discussed by a wide community, who can present edge-cases and suggested changes
Change slowly (or never) by necessity, allowing people to publish data without the model becoming out-of-date
A tendency to model more so a majority of potential users’ requirements are met
Designed in advance of use
Aspires to be a truthful representation of the domain
An academic approach, whereby ontologies are published by authors
Equate to the ‘vision’ in agile software development terms

Embedded ontologies

An attempt to model parts of a domain used with a software system
Discussed by the community using the system
Change frequently, in response to the software design and the requirements of the system
An MVO (minimal viable ontology), to match how the ontologies are used in the system
Designed in response to requirements of the system
Required to balance the demands of a software system, and the desire to be a truthful representation of the domain. For example, performance considerations
An engineering approach, whereby ontologies are emergent
Equate to the actual software in agile software development terms

In short, I would describe the most fundamental difference as follows:

Collaborative ontologies: designed to change as little as possible
Embedded ontologies: designed to be to changed as easily as possible

The idea of an embedded ontology might be less familiar, as they are certainly less common. For me, they are simply the data model of a software system, expressed using linked data. This is obviously very much the case where linked data technologies are used, such as Triple Stores or RDF, to power the software.

I think both roles are important and necessary, however the rest of this post is focused on demonstrating the need for embedded ontologies, because these are less widely used and understood.

Relationship between collaborative & embedded ontologies

Most commonly, the distinction is not made collaborative and embedded ontologies, and the conflict takes place over a single ontology. I believe a more productive approach would be to separate the embedded and collaborative ontologies, and to allow each to develop in response to it’s own pressures. The differences that emerge can inform a dialogue between abstract modelling and delivery software.

The use-case I will now present should demonstrate how this separate embedded ontology works in practice.

Embedded ontology use-case: A Sport app

To explain how embedded ontologies work, I will give an example: a software application in the sport domain.

First, to contrast, a collaborative sport ontology would be an attempt to model the domain of sport, comprising teams, sportspeople, disciplines, and perhaps sponsorship, venues, and fixtures.

Let us now assume I wanted to build a software system based around the domain of sport. My first question would be ‘what’s the MVP (minimal viable product)?’. I would hope to get an answer such as “A mobile app showing a list of teams in each English football division”. How would I go about building this? Well, good domain-driven-design would lead me to use a good model and language to describe the parts of my software. I might have a /teams endpoint in an API. I might build a TeamList component in the Javascript code. What I would not do, is attempt to model sportspeople, disciplines, sponsorship, venues and fixtures. I don’t need these yet. What I need is a Minimal Viable Ontology.

This used to be a problem, where SQL database schema were designed in advance of writing any software. Then agile came along, along with Lean, TDD, BDD and so on. These approaches showed that these upfront design approaches usually resulted in the wrong architectural design, badly modelled data structures, features nobody wants and unnecessary complexity.

So, whilst I believe that collaborative ontologies serve an important role in fostering collaboration in linked data, I also believe that we need strategies to make these approaches compatible with embedded ontologies, and therefore software engineering best practice.

Using ontologies in software

I want to now talk about some approaches to building embedded ontologies, used within software. As you will see, these contrast in a number of significant ways with approaches to building ontologies within academic communities.

These approaches are not intended to supercede collaborative ontologies, but instead, help reduce the friction between collaborative and embedded ontologies. And, therefore, between those building software using linked data, and those building models in academia.

Separate embedded & collaborative ontologies

A good first step is to get a distinct understanding of what constitutes the embedded ontology in-use when compared with the collaborative ontology. This could be done in a number of ways:

No collaborative ontology until necessary: For me, this is the most important. I believe the preemptive publishing of collaborative ontologies can be counter-productive; if an ontology can remain private and embedded until the software has been proven to work, then it will be a higher-quality representation of the domain. This adheres to the software principle that the best design is the design that emerges in response to iterative requirements. From my software perspective, a published and shared ontology is like an Open API, whilst the benefits of sharing can be huge, you have potentially also lost one of the most important capabilities in building software: the ability to change frequently and easily.
Modularisation: I would suggest this as a key second step. Breaking ontologies down into smaller parts, particularly in response to growth. This will reduce the rate at which these modules change individually when compared with the whole. This technique can avoid the necessity for mappings (see below).
Mapped: Where a clear mapping between the ontologies is expressed (perhaps using semantic equivalence). This comes with the overhead of maintaining two ontologies and a mapping.

Modularised ontologies

One issue with ontologies is the common practice of attempting to model a single ‘domain’ with a single ontology. A selling point for this is the ability to “work within a single namespace”. I think that this practice is counter-productive, particularly with regards to aligning collaborative and embedded ontologies.

If a modularised approach is taken, a number of options become available. If an ontology can be broken into a number of parts, with references between them, the parts can be assigned metadata to indicate the following:

Stability: For example, it should be possible to add a new ‘ontology module’ which is entirely experimental. In sport, this could be a ‘sponsorship’ module which is up for discussion in the community, but forms no part of the working software.
Module version: By separately versioning ontology modules, stability can be achieved within some modules, whilst others regularly change.
Equivalence/alternatives: Alternative or equivalent modules could exist if different perspectives on the same domain exist.

Dependency management between modules

Whilst I am aware that work has been done in this area, it has not matured to the level seen with library dependency management in software (e.g. Maven, Ivy etc). It is straightforward to build dependency graphs between ontologies, but the more subtle version-specific dependency management is not readily available. More efficient tooling, and a consensus on meta-ontologies in this area would lower the bar for fine-grained modularisation of ontologies.

Finally

It should now be clear that embedded ontologies are a by-product of software delivery. This in my view, is exactly how it should work (as long as the embedded ontologies are curated and crafted as the software grows, using the same principles applied to collaborative ontology design). Using this approach, I would suggest that a higher-quality, more robust ontology may emerge. It will be an ontology that has been road-tested by the necessity to deliver software that serves a particular audience. Perhaps this particular audience will skew the perspective of the ontology, but any model is, after all, just a perspective.

Once this ontology has undergone this growth, and subsequent road-testing, this is where the dialogue can get really interesting between a fit-for-purpose embedded ontology, and a collaborative ontology. Both will have much to offer, but I feel the benefits will flow in both directions.

I would be keen to hear if any of this reflects your own experiences, and, in particular, to get the perspective of individuals working exclusively on collaborative ontologies.

RDF Tree revisited: Developer-friendly JSON or XML from RDF graphs

Posted on November 19, 2012 by daverog

In my previous post I talked about RDF Tree, an approach to building JSON or XML data from RDF graphs. Having received a number of useful comments, particularly from those involved with JSON-LD, I have revisited the approach and would like to present a revised version.

What is RDF Tree?

RDF Tree is an approach (and a Java library in-development to implement the approach), to producing developer-friendly serialisations of RDF graphs. It is not a serialisation format in itself like JSON-LD, but simply an approach to building predictable, stable JSON and XML representations of graph data.

The aims of this approach are as follows:

RDF Tree serialisations are non-semantic
- Designed to power data-driven visual representation of data such as HTML
- Designed to be lossy: the RDF graph cannot be recovered from the data
  - It is best practice to offer the data as RDF also for clients that require semantic data
RDF Tree is designed to be flexible
- Whilst there are core principles, different rules, syntax and algorithms can be used to tailor the approach to a specific domain or use-case
RDF Trees are either single trees or multiple trees in an ordered list
- Tree root(s) are indicated in the RDF using the tree ontology (see previous post)
- For single trees, a specific root resource is known
- For multiple trees, an ordered list of root resources is known (duplicates allowed)
- RDF Trees can be built according to different rules
The general four rules for constructing the abstract tree from a graph structure are outlined in the previous blogpost.
As the rules can vary, there is no one canonical RDF Tree for a given graph input
Given a fixed set of rules, RDF Trees are produced as a function of a graph input
- Rules include:
  - When to stop traversing the graph when building the tree
  - How to ‘canonicalise’ the resulting RDF Tree (e.g. deterministic property ordering)
The JSON or XML produced using this approach is largely indistinguishable from ‘vanilla’ JSON or XML
- No superfluous meta or reference data is provided in order to extract the original graph or understand the specific semantics of the data
- Designed for use with generic JSON or XML parsing libraries
Where naming conflicts exist, stable prefixes are used to distinguish between properties
Assumptions are made to optimise the approach
- All data is considered single-language, different languages can be requested using the Accept-Language header
- Datatype handling is minimal – datatypes are expected to be predictable
  - No datatypes in XML
  - JSON value types are respected
Where possible, the JSON syntax is aligned with JSON-LD, with the principal difference being the absence of the “@context” metadata
Inverse properties are included with the “^” prefix in JSON and a inverse=”true” attribute in XML

What does RDF Tree look like?

For the given RDF Turtle input:

@prefix par:     <http://purl.org/vocab/participation/schema#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix geo:     <http://www.bbc.co.uk/ontologies/geopolitical/> .
@prefix foaf:    <http://xmlns.com/foaf/0.1/> .
@prefix owl:     <http://www.w3.org/2002/07/owl#> .
@prefix domain:  <http://www.bbc.co.uk/ontologies/domain/> .
@prefix oly:     <http://www.bbc.co.uk/ontologies/2012olympics/> .
@prefix xsd:     <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sport: <http://www.bbc.co.uk/ontologies/sport/> .
@prefix tree:  <http://purl.org/rdf-tree/> .

tree:tree tree:root <http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> .

<http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> a sport:CompetitiveSportingOrganisation ;
      oly:territory <http://www.bbc.co.uk/things/territories/gb#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain> ;
        domain:shortName "Great Britain & N. Ireland"^^xsd:string ;
      domain:name "Team GB"^^xsd:string .

<http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id> a sport:Person ;
      par:role_at <http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id> ;
      oly:dateOfBirth "1976-10-24"^^xsd:date ;
      oly:gender "M"^^xsd:string ;
      oly:height "172.0"^^xsd:float ;
      oly:weight "72.0"^^xsd:float ;
      domain:name "Ben Ainslie"^^xsd:string ;
      sport:competesIn <http://www.bbc.co.uk/things/2012/sam002#id>, <http://www.bbc.co.uk/things/2012/sam005#id> ;
      sport:discipline <http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9>, <http://www.facebook.com/pages/Ben-Ainslie/108182689201922> ;
      foaf:familyName "Ainslie"^^xsd:string ;
      foaf:givenName "Ben"^^xsd:string .

<http://www.facebook.com/pages/Ben-Ainslie/108182689201922> a domain:Document ; 
   domain:documentType <http://www.bbc.co.uk/things/document-types/external> , <http://www.bbc.co.uk/things/document-types/facebook> .

<http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9> a domain:Document ;
   domain:domain <http://www.bbc.co.uk/things/domains/olympics2012> ;
   domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> .

<http://www.bbc.co.uk/things/2012/sam002#id> a sport:MedalCompetition ;
        domain:name "Sailing - Men's Finn"^^xsd:string ;
        domain:shortName "Men's Finn"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM002000> ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn> .

<http://www.bbc.co.uk/things/2012/sam005#id> a sport:MedalCompetition ;
      domain:name "Sailing - Men's 470"^^xsd:string ;
        domain:shortName "Men's 470"^^xsd:string ;
      domain:externalId <urn:ioc2012:SAM005000> ;
        domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470> .

<http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id> a sport:SportsDiscipline ;
      domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/sailing> ;
      domain:name "Sailing"^^xsd:string .

<http://www.bbc.co.uk/things/territories/gb#id> a geo:Territory ;
      domain:name "the United Kingdom of Great Britain and Northern Ireland"^^xsd:string ;
      geo:isInGroup <http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> .

<http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id> a geo:Group ;
      domain:name "Europe"^^xsd:string ;
      geo:groupType <http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions> .

The following JSON is produced:

{
  "@id": "http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id",
  "@type": "http://www.bbc.co.uk/ontologies/sport/Person",
  "dateOfBirth": "1976-10-24",
  "familyName": "Ainslie",
  "gender": "M",
  "givenName": "Ben",
  "height": 172.0,
  "name": "Ben Ainslie",
  "weight": 72.0,
  "competesIn": [
    {
      "@id": "http://www.bbc.co.uk/things/2012/sam002#id",
      "@type": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition",
      "name": "Sailing - Men\u0027s Finn",
      "shortName": "Men\u0027s Finn",
      "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn",
      "externalId": "urn:ioc2012:SAM002000"
    },
    {
      "@id": "http://www.bbc.co.uk/things/2012/sam005#id",
      "@type": "http://www.bbc.co.uk/ontologies/sport/MedalCompetition",
      "name": "Sailing - Men\u0027s 470",
      "shortName": "Men\u0027s 470",
      "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470",
      "externalId": "urn:ioc2012:SAM005000"
    }
  ],
  "discipline": {
    "@id": "http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id",
    "@type": "http://www.bbc.co.uk/ontologies/sport/SportsDiscipline",
    "name": "Sailing",
    "document": "http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"
  },
  "document": [
    {
      "@id": "http://www.facebook.com/pages/Ben-Ainslie/108182689201922",
      "@type": "http://www.bbc.co.uk/ontologies/domain/Document",
      "documentType": [
        "http://www.bbc.co.uk/things/document-types/facebook",
        "http://www.bbc.co.uk/things/document-types/external"
      ]
    },
    {
      "@id": "http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9",
      "@type": "http://www.bbc.co.uk/ontologies/domain/Document",
      "documentType": "http://www.bbc.co.uk/things/document-types/bbc-document",
      "domain": "http://www.bbc.co.uk/things/domains/olympics2012"
    }
  ],
  "role_at": {
    "@id": "http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id",
    "@type": "http://www.bbc.co.uk/ontologies/sport/CompetitiveSportingOrganisation",
    "name": "Team GB",
    "shortName": "Great Britain \u0026 N. Ireland",
    "document": "http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain",
    "territory": {
      "@id": "http://www.bbc.co.uk/things/territories/gb#id",
      "@type": "http://www.bbc.co.uk/ontologies/geopolitical/Territory",
      "name": "the United Kingdom of Great Britain and Northern Ireland",
      "isInGroup": {
        "@id": "http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id",
        "@type": "http://www.bbc.co.uk/ontologies/geopolitical/Group",
        "name": "Europe",
        "groupType": "http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions"
      }
    }
  }
}

And the following XML is produced:

<Person id="http://www.bbc.co.uk/things/4e40ce40-b632-4a42-98d7-cf97067f7bf9#id">
  <dateOfBirth>1976-10-24</dateOfBirth>
  <familyName>Ainslie</familyName>
  <gender>M</gender>
  <givenName>Ben</givenName>
  <height>172.0</height>
  <name>Ben Ainslie</name>
  <weight>72.0</weight>
  <competesIn>
    <MedalCompetition id="http://www.bbc.co.uk/things/2012/sam002#id">
      <name>Sailing - Men's Finn</name>
      <shortName>Men's Finn</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-finn"/>
      <externalId id="urn:ioc2012:SAM002000"/>
    </MedalCompetition>
  </competesIn>
  <competesIn>
    <MedalCompetition id="http://www.bbc.co.uk/things/2012/sam005#id">
      <name>Sailing - Men's 470</name>
      <shortName>Men's 470</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing/events/mens-470"/>
      <externalId id="urn:ioc2012:SAM005000"/>
    </MedalCompetition>
  </competesIn>
  <discipline>
    <SportsDiscipline id="http://www.bbc.co.uk/things/d65c5dce-f5e4-4340-931b-16ca1848d092#id">
      <name>Sailing</name>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/sports/sailing"/>
    </SportsDiscipline>
  </discipline>
  <document>
    <Document id="http://www.facebook.com/pages/Ben-Ainslie/108182689201922">
      <documentType id="http://www.bbc.co.uk/things/document-types/facebook"/>
      <documentType id="http://www.bbc.co.uk/things/document-types/external"/>
    </Document>
  </document>
  <document>
    <Document id="http://www.bbc.co.uk/sport/olympics/2012/athletes/4e40ce40-b632-4a42-98d7-cf97067f7bf9">
      <documentType id="http://www.bbc.co.uk/things/document-types/bbc-document"/>
      <domain id="http://www.bbc.co.uk/things/domains/olympics2012"/>
    </Document>
  </document>
  <role_at>
    <CompetitiveSportingOrganisation id="http://www.bbc.co.uk/things/7ef7ffdf-f101-4470-adc0-38a5abac9122#id">
      <name>Team GB</name>
      <shortName>Great Britain &amp; N. Ireland</shortName>
      <document id="http://www.bbc.co.uk/sport/olympics/2012/countries/great-britain"/>
      <territory>
        <Territory id="http://www.bbc.co.uk/things/territories/gb#id">
          <name>the United Kingdom of Great Britain and Northern Ireland</name>
          <isInGroup>
            <Group id="http://www.bbc.co.uk/things/81b14df8-f9d2-4dff-a676-43a1a9a5c0a5#id">
              <name>Europe</name>
              <groupType id="http://www.bbc.co.uk/things/group-types/bbc-news-geo-regions"/>
            </Group>
          </isInGroup>
        </Territory>
      </territory>
    </CompetitiveSportingOrganisation>
  </role_at>
</Person>

Property names

RDF Tree uses the property’s local name as the JSON field name. If a name conflict exists (more than one IRI exists for the same local name), then the IRI prefix is used to distinguish the properties, e.g. “foaf:name” where another “name” exists. A namespace priority list is used to determine which IRI can be expressed as just the local name, and which requires the prefix.

Essentially, no two properties can have the same name. However, property names can vary depending on the presence of other properties with the same local name.

The same approach is used in the XML element names, except the separator char is “-” resulting in disambiguated element names like <foaf-name/>.

Stable property set

Even though naming inconsistencies will be rare, the potential can be reduced by adding properties to a list of ‘stable’ IRIs with prefix and unique local name. This set will contain the definitive set of unambiguous local names. This set will never be visible to users of the data, and is simply there to ensure the stability of the data.

Introducing Tripliser

Posted on June 21, 2011 by daverog

I recently had to solve the problem of how to take XML, in a predefined format, and create RDF representing the semantics of the data. I began using XSLT, but gradually the edge cases to handle inconsistencies in the input XML caused the XLST to become verbose and incomprehensible (being a mix of syntax handling and business logic). Errors were hard to diagnose and failures were not effectively recovered from. I decided to write a library to help me with this problem, called Tripliser…

>> Homepage | >> GitHub

Tripliser is a Java library and command-line tool for creating triple graphs, and RDF serialisations, from XML source data. It is particularly suitable for data exhibiting any of the following characteristics:

Messy – missing data, badly formatted data, changeable structure
Bulky – large volumes of data
Volatile – ongoing changes to data and structure, e.g. feeds

Other non-RDF source data may be supported in future such as CSV and SQL databases.

It is designed as an alternative to XSLT conversion, providing the following advantages:

Easy-to-read mapping format – concisely describing each mapping
Robust – error or partial failure tolerant
Detailed reporting – comprehensive feedback on the successes and failures of the conversion process
Extensible – custom functions, flexible API
Efficient – facilities for processing data in large volumes with minimal memory usage

XML files are read in, and XPath is used to extract values which can be inserted into a triple graph. The graph can be serialised in various RDF formats and is accompanied by meta-data and a property-by-property report to indicate how successful or unsuccessful the mapping process was.

Here’s what a typical mapping format looks like…

<?xml version="1.0" encoding="UTF-8"?>
<rdf-mapping xmlns="http://www.daverog.org/rdf-mapping" strict="false">
	<constants>
		<constant name="objectsUri" value="http://objects.theuniverse.org/" />
	</constants>
	<namespaces>
		<namespace prefix="xsd" url="http://www.w3.org/2001/XMLSchema#" />
		<namespace prefix="rdfs" url="http://www.w3.org/2000/01/rdf-schema#" />
		<namespace prefix="dc" url="http://purl.org/dc/elements/1.1/" />
		<namespace prefix="universe" url="http://theuniverse.org/" />
	</namespaces>
	<graph query="//universe-objects" name="universe-objects" comment="A graph for objects in the universe">
		<resource query="stars/star">
			<about prepend="${objectsUri}" append="#star" query="@id" />
			<properties>
				<property name="rdf:type" resource="true" value="universe:Star"/>
				<property name="dc:title" query="name" />
				<property name="universe:id" query="@id" />
				<property name="universe:spectralClass" query="spectralClass" />
			</properties>
		</resource>
		<resource query="planets/planet">
			<about prepend="${objectsUri}" append="#planet" query="@id" />
			<properties>
				<property name="rdf:type" resource="true" value="universe:Planet"/>
				<property name="dc:title" query="name" />
				<property name="universe:id" query="@id" />
				<property name="universe:adjective" query="adjective" />
				<property name="universe:numberOfSatellites" dataType="xsd:int" query="satellites" />
			</properties>
		</resource>
	</graph>
</rdf-mapping>

Go to the Homepage or to GitHub to find out more.

Aggregate! Aggregate! Aggregate! Using linked data to make websites more interesting.

Posted on September 27, 2010 by daverog

The way content is modelled is pivotal to the overall, technical design of a website, in particular, the choice of content management system.

Many bespoke CMSs are built using a relational database as the content repository. Most generic, off-the-shelf CMSs now use NoSQL technology, specifically document stores, to provide the necessary flexibility of schema design. Aside from these two, key content storage technologies, a small number of CMSs, usually hosted services, make direct use of triple stores: content repositories that use linked data technology.

Each type of content repository has advantages and disadvantages:

Relational database: Supported by a wealth of mature technology and offers relational integrity. However, schema change can be slow and complex.
Document store: Provides schema flexibility, horizontal scalability and content ‘atomicity’ (the ability to handle content in discreet chunks, for versioning or workflow). However, referential integrity is hard to maintain and queries across different sets of data are harder to achieve.
Triple store: Provides logical inference and a fundamental ability to link concepts across domains. However, the technology is less mature.

I think there are two diametric content modelling approaches; most approaches will fit somewhere, on a sliding scale, between the two extremes. Each exhibits properties that make some things simple, and some things complex:

Atomic content: Content is modelled as a disconnected set of content ‘atoms’ which have no inter-relationships. Within each atom is a nugget of rich, hierarchically structured content. The following content management features are easier to implement if content is modelled atomically:
- Versioning
- Publishing
- Workflow
- Access control
- Schema flexibility
Relational content: Content is modelled as a graph of small content fragments with a rich and varied set of relationships between the fragments. The following content management features are easier to implement if content is modelled relationally:
- Logical inference
- Relationship validation/integrity
- Schema standardisation

The following diagram indicates where the different content repositories fit on this scale:

For the scenario outlined below, two ‘advanced’ features have been used to highlight tension between atomic and related content models. The versioning feature causes problems for more related models, and the tag relationships feature causes problems for more atomic models.

The online film shop

A use-case to explain why linked data technology can help to build rich, content-driven websites

An online film shop needs to be built. It will provide compellingly presented information about films, and allow users to purchase films and watch them in the browser.

Here are some (very) high-level user-stories:

“As a user I want to shop for films at, http://www.onlinecinemashop.com, and watch them online after purchase”

“As an editor I need to manage the list of available films, providing enough information to support a compelling user experience.

An analysis indicates that the following content is needed:

Films
- The digital movie asset
- Description (rich-text with screen-grabs & clips)
- People involved (actors, directors)
- Tags (horror, comedy, prize-winning, etc…)

Some basic functionality has been developed already…

A page for each film
- A list of participants, ordered by role
A page for each person
- A list of the films they have participated in
An A-Z film list
A shopping basket and check-out form
A login-protected viewing page for each film

But now, the following advanced functionality is needed…

Feature 1, Versioned film information: This supports several features including:
- Rolling-back content if mistakes are made
- Allowing editors to set up new versions for publishing at a later date, whilst keeping the current version live
- Providing a workflow, allowing approval of a film’s information, whilst continuing to edit a draft version
Feature 2, Tagged content: Tags are used to express domain concepts, these concepts have different types of relationships which will support:
- Rich navigation, e.g. “This director also worked with…”
- Automated, intelligent aggregation pages, e.g. An aggregation page for ‘comedy’ films, including all sub-genres like ‘slapstick’

Using a relational database

Using a relational database, a good place to start is by modelling the entities and relationships…

Entities: Film, Person, Role, Term
Relationships: Participant [Person, Role, Film], Tag [Term, Film]

Assuming that the basic feature set has been implemented, the advanced features are now required:

Advanced feature 1: Versioned films

Trying to version data in a relational database is not easy: often a ‘parent’ entity is needed for all foreign keys to reference, with ‘child’ entities for the different versions, each annotated with version information. Handling versioning for more than one entity and the schema starts getting complicated.

Advanced feature 2: Tag relationships

When tagging content, one starts by trying to come up with a single, simple list of terms that collectively describe the concepts in the domain. Quickly, that starts to break down. Firstly a 2-level hierarchy is identified; for example, putting each tag into a category: ‘Horror: Genre’, ‘Paris: Location’, and so on. Next, a multi-level hierarchy is discovered: ‘Montmatre: Paris: France: Europe’. Finally, this breaks down as concepts start to overlap and repeat. Here are a few examples: Basque Country (France & Spain), Arnold Schwarzenegger (Actor, Politician, Businessman), things get even trickier with Bruce Dickinson. Basically, real-world concepts do not fall easily into a taxonomy, and instead form a graph.

Getting the relationships right is important for the film shop. If an editor tags a film as ‘slapstick’, they would expect it to appear on the ‘comedy’ aggregation page. Likewise, if an editor tagged a film as ‘Cannes Palme d’Or 2010 Winner’ they would expect it to be shown on the ‘2010 Cannes Nominees’ aggregation page. These two ‘inferences’ use different relationships, but achieve the correct result for the user.

With a relational database 2-level hierarchies introduce complexity, multi-level hierarchies more so, and once the schema is used to model a graph, the benefits of a relational schema have been lost.

Relational database: Advanced features summary

Feature 1 is complex
Feature 2 is complex

Using a document store

NoSQL storage solutions are currently very popular, offering massive scalability and a chance to break-free from strict schema and normal-forms. From a document store perspective. the original entities look very different, compiled into a single structure:

<film> <name>Pulp Fiction</name> <description>The lives of two mob hit men...</description> <participants> <participant role="director">Quentin Tarantino</participant> <participant role="actor">Bruce Willis</participant> ... </participants> <tags> <tag>Palme d'Or Winner 1994 Cannes Film Festival</tag> <tag>Black comedy</tag> <tag>Neo-noir</tag> ... </tags> </film>

The lists of available tags or participants may be handled using separate documents, or possibly ‘dictionaries’ of allowed terms.

Assuming again that the basic feature set has been implemented, the advanced features are now required:

Advanced feature 1: Versioned films

Keep multiple copies the film document, each marked to indicate the different versions. All the references are inside the document, so even the tags and roles are versioned too. Many off-the-shelf CMS products offer built-in document versioning based on this process.

Advanced feature 2: Tag relationships

This is even worse now. Before, with a relational database, we could keep adding more structures, using SQL queries to extract the hierarchical relationships. It was only when moving to graphs that relationships became too complex to manage. Without SQL to query a structure of tags, bespoke plug-in code is required to extract meaning from a bespoke hierarchy of terms. This is now too complicated to build anything simple and reusable.

Document store: Advanced features summary

Feature 1 is simple
Feature 2 is very complex

Using a triple store

In triple stores, all the content is reduced to triples. So all our content would look something like this…

http://www.imdb.com/title/tt0450345/ -> dc:title -> “The Wicker Man”

Assuming again that the basic feature set has been implemented, the advanced features are now required:

Advanced feature 1: Versioned films

Versioning triples would not be straightforward. A complex overlaid system of ‘named graphs’ could be used to separate out content items into different versions, but this would be more complex than for relational databases.

Advanced feature 2: Tag relationships

This is simple! SPARQL queries will offer us all the inferred or aggregated content that we require, as long as the ontology has been created, and curated, with care.

Triple store: Advanced features summary

Feature 1 is very complex
Feature 2 is simple

Why a single content model can’t solve all your problems

In the introduction I explained how each of the different content repositories has advantages and disadvantages. It is possible to model most domains, entirely, in either relational databases, document stores or triple stores. However, as the examples above show, some features are too complicated to implement for some content repositories. The solution, in my opinion, is to use different content repositories, depending on what you need to do, even with the same domain.

As an analogy, serving a search engine from a relational database is not a good idea, as the content model is not well tuned to free-text searching. Instead a separate search engine would be created to serve the specific functionality, where relational content is flattened to produce a number of search indices, allowing fast, free-text search capabilities. As this example shows, for a particular problem, content must be modelled in an appropriate way.

Using a document store + triple store

For this solution, document content is converted into triples by means of a set of rules. These rules will be processed by means of a ‘bridge’ which will convert the change stream provided by the document store, such as Atom, into RDF or SPARQL. In this way, a triple store will contain a copy of the document data in the form of triples (with a small degree of latency due to the conversion process). The system now provides a document store for fast, scalable access to the full content, and a SPARQL endpoint providing all the inference and aggregation that is needed.

The following diagram represents the architecture used in this solution. It combines a document store, and a triple store, to provide both atomic content management and relational structures and inferences:

Continuing with the advanced feature implementation…

Advanced feature 1: Versioned films

See ‘document stores’.

Advanced feature 2: Tag relationships

See ‘triple stores’

Document store + triple store: Advanced features summary

Feature 1 is simple
Feature 2 is simple

By the leveraging the innate advantages of two different content models, all the advanced features are simple to achieve!

Cross-domain aggregation…

What seems interesting about this proposed solution, is that it involves the use of semantic web technologies without publishing any data interchange formats on the web. It would, however, seem a shame to stop short of publishing linked data. I would advocate that this approach is additionally used as a platform to support the publication of linked data, where the overhead of doing so, through the use of a triple store, is significantly reduced.

Combining this approach with ‘open linked data’ and SPARQL endpoints would demonstrate the power of modelling content as triples: expressing relationships in a reusable, standardised way, allows content to be aggregated within, or between, systems, ecosystems, or organisations.

Q&A

If a relational database is not used for content, how can data integrity be maintained?

In short, it is not. Integrity is applied at the point of entry through form validation and referential look-ups. Integrity then becomes the responsibility of systems higher up the chain (for example, the code that renders the page: if an actor doesn’t exist for a film, do not show them). If data integrity really is needed (for example, financial or user data), maybe an SQL database is a better choice; it is still possible to bridge content into the triple store if it is needed.

How is the ontology data managed?

This is a tricky one, and it really depends….I think there are three choices:

Curate the domain ontology manually: Editing and importing RDF/turtle. Needs people with the relevant skills. Could be ideal for small, stable ontologies
Use a linked data wiki: Some tools and frameworks exist for building semantic wikis (e.g. Ontowiki). This could be too generic an approach for some, more complex, ontologies.
Use a relational database CMS + data bridge: Despite the drawbacks of relational databases for content, it may still be a practical solution for ontological data. By building a CMS on a relational database we get all the mature database technology and data integrity, only leaving the tricky process of bridging the ontology data into a triple store.

And finally…

Some of the ideas in this post are based on projects happening at the BBC, many thanks in particular to the BBC News & Knowledge team for sharing their recent endeavours into semantic web technologies.

MPs in the Media Mash-up

Posted on August 13, 2009 by daverog

At the end of July the Guardian held an internal hackday at their offices in King’s cross. They invited me back, and another engineer from BBC Radio’s A&Mi department, Chris Lowis, came along with me. We teamed up with Leigh Dodds & Ian Davis from Semantic Web specialists, Talis to produce an ‘Interactive-MP-Media-Appearance-Timeline’ by mashing up data from BBC Programmes and the Guardian’s website.

Before the event Talis extracted data about MPs from the Guardian’s Open Platform API and converted it into a Linked Datastore. This store contains data about every British MP, the Guardian articles in which they have appeared, a photo, related links and other data. Talis also provide a SPARQL endpoint to allow searching and extraction of the data from the store.

Coincidentally, the BBC programmes data is also available as a linked datastore. By crawling this data using the MP name as the search key we were able to extract information about the TV and radio programmes in which a given MP had appeared. A second datastore was created from the combination of these two datasets, and by pulling in some related data from dbpedia. Using this new datastore we created a web application containing an embedded visualisation of the data.

We created the web-app using the lightweight ruby web-framework Sinatra. A simple RESTful URL schema provided access to a web page showing basic information about an MP.

Nick Clegg: a busy man in 2009

In addition we queried the datastore to give a list of all of the MPs appearances across Guardian and BBC content. This was returned as a JSON document, and passed to an embedded Processing Java applet. A Java applet may seem like an unusual choice in 2009, but Processing is an excellent choice for the rapid development of responsive graphics applications, due to its integration with existing JAVA libraries, and its powerful graphics framework.

Leigh at Talis put together a screencast showing the app in action:

The Processing applet shows a month-by-month scrollable timeline. The user can move back and forward in time, at variable speeds, by pressing the mouse either side of the applet frame. In each month slot, a stack of media appearances is displayed, represented by the logo of the BBC brand, or in the case of Guardian articles, the Guardian brand. Moving the mouse over a media appearance reveals the headline or programme description and clicking a media appearance will navigate the browser to the episode page on the /programmes or the article page on guardian.co.uk.

We demonstrated the application to the hackday audience, and in the prize giving ceremony were awarded the ‘Best use of third-party data’ award. We think that the application demonstrates some of the ways the structured RDF data provided by BBC’s /programmes website can be used. This project shows how powerful the linked-data concept is when used in conjunction with other data that has been exposed in a similar way. As more media organisations expose their domains in this manner, more interesting and wide-reaching visualisations and web-applications can be built.

Thanks to Chris Lowis for contributions to this post. Photos courtesy of Mat Wall.

Now also published on the BBC’s Internet blog