On digital libraries

May 3, 2008

Understanding the OAI-ORE data model

Filed under: Uncategorized — Tags: , , , — hochstenbach @ 6:55 am

OAI-ORE is the new data exchange model proposed by the Open Archives group, currently still in alpha version, hopefully at the end of 2008 in version 1.0 release. I have to admit that the data model and news groups have some tough reading when you are not totally into Semantic Web technologies like OWL, Named Graphs, SPARQL and the various RDF Vocabularies. I’ll try to make some general observations on OAI-ORE from a programmer’s viewpoint without having to go into much detail, syntax and semantics. But, how daunting the details may be, this is a quite exciting subject area attracting huge audiences.

Three concepts are important when reading the OAI-ORE specs: resources, aggregations and named graphs. These are the building blocks on which the OAI-ORE is based. Understanding their meaning and usage will give you much insight in the data model.

I

First, resources. We may all know what resources are. You download them from the Internet or point to them in your HTML href attributes. But from a formal RDF standpoint resources are something quite abstract. If you read the RDF Semantics document, it states that a resource is a very generic term that can stand for ‘anything in the universe of discourse’. An RDF resource could be an image of Bill Gates, a poem by William Shakespeare, but also my coffee mug on my table or a concept like General Relativity. Anything that can be named can be an RDF resource. How do you name resources? With URI’s. These URI names are like the java.util.URI class in Java. They have all the attributes and behavior of URI’s but you don’t use them to fetch data. In RDF Semantics we read: “[RDF] semantics doesn’t assume any particular relationship between […] a URI […] and a document or Web resource which can be retrieved by using that URI reference.” And in RDF Concepts: “[…] nothing requires that an RDF application be able to retrieve any representation of resources identified by the URI’s”. In practice this means that RDF applications use URI as names (comparing names and defining relations between names), not as actionable entities (having MIME types, sizes, last modification dates). This makes the RDF world a network overlay, relating resource by their name without much participation in the Web Architecture.

Things change when RDF graphs are serialized in documents that are a part of the Web Architecture. Then we get the vision of the Sematic Web of Tim Berners-Lee. Here URI’s are not only name things but also point to a RDF/XML documents which contain new URI’s which point to other RDF/XML etc. URI’s in this form are resource locators as in URL’s that have MIME types, protocols, gateways, proxies, etc. When we read the OAI-ORE documents and see protocol-based URI’s, read URL’s. The resources in OAI-ORE not only name things, but should also be used to retrieve things. There is some discussion how the OAI-ORE ecology should work (how much to retrieve, when to retrieve, when are links between two resources conceptional or physical, etc). But this is the basic idea: OAI-ORE, behaves like HTML in treating resources. No abstract naming, real linking.

II

Second, aggregations. Why or when do we use OAI-ORE? We need OAI-ORE to identify a group of web resources (read URL’s) and make assertions about this group. OAI-ORE originates from the library world so most examples will use publications. Take Marvin Minsky “K-Lines: A Theory of Memory article in MIT’s institutional repository: http://dspace.mit.edu/handle/1721.1/5739. This is a HTML page created for humans to understand.

We see a metadata record and two versions of Marvin’s article one in PS format and one in PDF format. Computers however don’t have that easy time. They see this:

<a target="_blank" href="/bitstream/1721.1/5739/1/AIM-516.ps">View/Open</a>
[..]

<a target="_blank" href="/bitstream/1721.1/5739/2/AIM-516.pdf">View/Open</a>

A programmer would need to screen scrape this page, find the href’s which contain the ‘/bitstream/’ paths, and infer that these could be the same versions of a document based on the extension of the filename. This is not an easy thing to do on a world wide scale. It would be better to have a machine readable RDF/XML document which provides a structured view of this webpage with all the semantics to understand the relations between all the resources. In a Semantic Web World we would like to refer to Marvin’s article. Wouldn’t it be better still to point to a address where a machine readable RDF/XML document could be found? This is OAI-ORE again.

There is some discussion how best to serialize OAI-ORE documents, in RDF/XML or Atom. But this is the idea: OAI-ORE is especially suited to talk about groups/aggregations of resources. Because, OAI-ORE is based on RDF it can do a lot more than that and will interact with other RDF resources on the Semantic Web. But talking about groups/aggregations of web resources (giving them identifiers, creators, provenance), this is ORE’s niche.

Why is an OAI-ORE Aggregation that special, you might ask? Couldn’t existing concepts such as RDF containers or RDF collections be used? Indeed, OAI-ORE Aggregations behave a bit like rdf:Bag’s in some situations, rdf:Seq’s and rdf:Alt’s in others. Aggregations are a generic RDF container, but add a lot more semantics to it. As you might know, RDF Container’s have a very limited vocabulary. As is stated in the RDF Semantics: ”any ‘natural’ assumptions concerning RDF containers are not formally sanctioned by the RDF model theory”. OAI-ORE tries to get some of those ‘natural’ assumptions back into aggregations.

III

Third, Named Graphs. RDF statements are triples composed out of a subject, a predicate and an object. Create many triples and you’ve have an RDF graph. When exchanging and merging RDF graphs between computers, there is a problem that you loose track on who created the graph. There is no good way in RDF to add provenance metadata to an RDF graph. When merging two graph in a triple store this information is usually lost. This is bad for aggregations. You want to know who made the aggregation, at what time, for what reason. You could do this by adding new triples to the graph stating that each of them was created by Mr. X. This is a process called reification, but is a bit problematic for two reasons: 1) you need very many triples to create a statement about a whole graph, 2) even if you did all the work, the triples do not mean what you might hope.

To solve this problem extensions to the RDF model and syntax are proposed, none of them have made it into a W3C recommendation, alas. OAI-ORE is using the Named Graph approach created by Carroll, Bizer, Hayes and Stickler. The idea is to add to every graph a name, a URI. When using this URI in a triple you are creating by definition statements about the whole graph. There is some discussion how this graph name is best serialized, because RDF doesn’t have any notion of Named Graphs. OAI-ORE is using a combination of two proposals:

  • If you retrieve an RDF/XML document which contains a xml:base, then this is the name of the graph.
  • If you retrieve an RDF/XML document which doesn’t have a xml:base, then the URL location of the document is the name of the graph.

Both have its advantages and disadvantages. The most problematic is that the name of a graph is used as the identifier of a resource, name of a graph and web location of a graph. So, when writing RDF applications for Named Graphs you need to take great care to figure out what a URI really means. Another problematic point is that every Named Graph should be in a separate document. For instance, it would not be possible to give names to two sub-graphs using only one OAI-ORE document. And third, it is not clear what happens to the model when you mix graphs with names and graph without names.

Named Graphs in OAI-ORE are used to give a name to the aggregation graph. This graph has been given the name ‘Resource Map’ with an URI called URI-R. This URI-R is because of the identifier problems mentioned above, the name of the aggregation graph, the identifier of a ‘Resource Map’ document used to add provenance metatdata to the aggregation graph and the location of the ‘Resource Map’ document itself. Reading the data model one needs to separate these different usages. To refer to a aggregation not in context of an URI-R is also possible by using the name of the aggregation container itself called the URI-A.

Concluding, an OAI-ORE aggregation is a resource with name URI-A. The aggregation behaves a bit like an RDF container and contains resources which can all be interrelated. All the triples of the aggregation and its (inter)relations make up a graph. This graph has been given a name URI-R. This URI-R is used as a resource to add provenance metadata to the aggregation graph.

I hope these concepts give you some clues how to interpret the OAI-ORE datamodel. Personally I think OAI-ORE is a very good step in the direction of a truely Semantic Web. But, I am still lagging behind all the acronyms and consequences of those acronyms used in the data model.

1 Comment »

  1. Patrick,

    I’ll just toss in that I created an addon in DSpace@MIT for the XMLUI which generates an RDF ORE Map for the Item…it is linked from the html page so that semantic enabled web browsers (Tabulator) are able to access it in reference to the html page.

    http://dspace.mit.edu/metadata/handle/1721.1/5739/rdf.xml

    Mark

    Comment by Mark Diggory — June 10, 2009 @ 7:19 pm


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.