A scenario that repeatedly arises when working with triple stores, to store and query RDF data, is the handling of updates to the data. The triple store can effectively be regarded as a single, large graph, where updates come in the following forms:
- Add this new thing
- Remove this old thing
- Update thing X
Essentially the CUD of CRUD. where the R is covered by SPARQL SELECT, DESCRIBE, CONSTRUCT and ASK.
The problem comes with defining the ‘thing’. When considering the ‘thing’, for example a person or place, the underlying philosophies of Linked Data lie in contrast to the aim of clearly defining the properties and structure of the ‘thing’:
- Anyone can say anything about anything
- Data can be sparse
- Data can be inconsistent, or contradictory
The additional complexity of clearly defining the shape of a ‘thing’, I think, explain why approaches to RDF updates often avoid the CRUD approach, and take more direct, but problematic approaches.
Problematic approaches to RDF updates
The ingest of huge datasets using one-off processes such as data-dumps or R2RML.
This approach will not scale to frequently updating systems. It might be workable for data warehousing, data analysis or systems with very static data needs, but fast-moving systems with frequent changes are disrupted by large updates. This is particularly so with the majority of triple store implementations, where the ingest of large datasets is an effective block on all other updates.
The problem here, is that the process must be treated like a dbdeploy script, each update must be processed in a precise order to a known state. A change-set can corrupt the data if applied before or after it is supposed to be. In practice, this makes the use of this approach for updates brittle and error-prone. One mistake, and you’re in rollback-hell, and managing this across multiple environments, each with differing states, is overly-complex.
A solution – object updates
One solution is to take a step back from the data, and think about how you might want the updates to happen from the perspective of the domain. For example, take MusicBrainz data, which will follow a very particular update pattern:
- Updates clustered around musical artists or groups will be common
- Updates spread widely and thinly across musical artists and group will be rare
Therefore, an update feed that provides the complete information for a musical artist or group each time any part of it is updated, will be a reasonably efficient method of performing updates. The overhead is that somehow, a decision needs to be made about how to chop up the data within the domain of music, raising questions such as:
- Which ‘things’ will be chosen as the unit-of-update?
- Which properties belong to which ‘things’?
To be efficient, it is important to update any given property along with the ‘thing’ with which is it most likely to be updated.
What is described here is essentially a form of object-graph-mapping, where the ‘object’ is synonymous with the previously discussed ‘thing’. The domain is divided up into a set of object classes, where the data is deterministically assigned to instances of these object classes.
To continue with the music example, the domain could be split into artists and compositions, with a property like ‘compose’ connecting artists with compositions, considered part of the composition object. This is because, the most likely update scenario is that a new composition is added to the database, with the artist assigned when the composition is added. A less likely scenario is that all the compositions an artist produced will be added or updated along with the artists’ basic information like name and date of birth. Clearly this will sometimes be a subjective decision.
Advantages of object updates
- A sub-set of the data can be held by a system consuming the updates
- If the updates are processed out-of-order, or not from the beginning
- The resulting data is likely to be eventually consistent
- The resulting data remains structurally sound
- Updates scale regardless of the total size of the complete dataset
Disadvantages of object updates
- The data has to be divided up into object classes, which need to be designed and maintained
- Additional data is required to perform an update – it must be clear which data in a triple store needed to be removed before applying the new set of data (see below)
The advantages above show that this approach solves the problems with both the heavy-weight and fine-grained approaches.
What would an object update feed look like?
An update feed, as described above, would be simple:
- Each update would be an (ideally) small graph representing one ‘object’, such as an individual artist or recording
- Metadata would be provided to indicate required additional information such as:
- Type of ‘object’
- Who made the change
- When the change was made
- Allowing activities such as:
Performing updates with a object update feed
As indicated in the disadvantages above, in order to update an object, it is necessary to know exactly which triple need to be removed – the triples that describe the previous version of the object. I have seen two approaches here, the first is to use a SPARQL query to pull in the triples for removal. This approach is just as brittle at the fine-grained approach, so I would not recommend it. A better alternative is to use context or named graphs – where each object resides in its own context, which can be removed on update.
I am of the opinion that context should be used exclusively for data management within a triple store. But perhaps I will leave that discussion for a later post.