On digital libraries

May 11, 2008

<about> { Named Graphs }

Filed under: Uncategorized — Tags: , , , , , , — hochstenbach @ 5:15 pm

In my previous post I’ve explained that RDF graphs are collections of triples each containing a subject, a predicate and an object. The standard serialization for RDF graphs are RDF/XML documents which are made available on the World Wide Web. RDF agents harvest these RDF/XML documents and store the resulting merged graph in triple stores. These triple stores can be queried using the SPARQL language, which is the SQL of the Semantic Web. SPARQL is used for further processing of the information carried in the RDF graphs.

Fig 1. Simplified view of an Semantic Web application

One problem with storing RDF/XML graphs in a triple store is that information about the origin of the RDF graph could be lost, depending on the triple store used [Ref]. To show how this can happen take a look at Fig 1. An RDF graph is created which contains statements about characters appearing in one of Shakespeare’s plays. The triples of the RDF graph live, for the sake of argument, as some information spread around in tables and columns of a database application. We use a pseudo N3 notation in this blog to display the in memory view of the docA RDF graph:

<#Romeo> r:loves <#Juliet>
<#Juliet> r:daugherOf <#LadyCapulet>
<#Mercutio> r:friendOf <#Romeo>

To serialize this RDF graph an RDF/XML document ‘docA’ is created and published on a webserver. There are other RDF/XML documents on the Internet called ‘docB’, ‘docC’, etc. Here is the in memory view of docB:

<#Romeo> r:sonOf <#LordMontague>

Note, the RDF/XML serialization of these graphs are not shown in these examples.

An RDF application harvests all these records, processes them and stores them in a triple store. The graphs are not stored as two seperate documents. No, a merged graph is created which contains the combined triples of all documents. If the two graphs docA and docB both contain statements about Romeo, then all these statements will be thrown on a heap in the triple store:

 <#Romeo> r:loves <#Juliet>
 <#Romeo> r:sonOf <#LordMontague>
 <#Juliet> r:daugherOf <#LadyCapulet>
 <#Mercutio> r:friendOf <#Romeo>

Without special precautions, it is not possible to say which graph made which statement on Romeo.

To solve this problem, a process called RDF reification can be used. Reification are statements about stamentens. We could say that “Romeo loves Juliet” was created by docA:

 <docA> a:type a:Statement
 <docA> a:subject <#Romeo>
 <docA> a:predicate r:loves
 <docA> a:object <#Juliet>

Which means something like “docA, says: ‘Romeo loves Juliet’”. Do this for all the statements in all the graphs in docA, docB, .., store them again in the triple store, and you will have the context in which all statements were made. This is correct and works most of the time. However, formally you’ve created something that might mean something different than you hope [Ref]. RDF has powerful expressive power with layered semantics on top of which ontologies, rules, logic and proof of statements can be added. Reification adds to RDF the ability to create statements about statements. But, the resulting reified triple doesn’t have the same expressive power. A reified triple isn’t the triple itself. If we created exactly the same reified triple for a docC:

 <docC> a:type a:Statement
 <docC> a:subject <#Romeo>
 <docC> a:predicate r:loves
 <docC> a:object <#Juliet>

, then we can’t conclude that the same statement “Romeo loves Juliet” appears in both documents [Ref]. In RDF, reification is not a quoting mechanism [Ref].

Over the years extensions are proposed to the RDF to add contextual information to RDF graphs in other ways. One proposal is to move from triples (subject, predicate, object) to quads (context, subject, predicate, object) [Ref]. But this solution is dependent on client-side adaptation. Another proposal is to give names (URI’s) to the (sub)graphs by RDF graph creators, in a solutoin called Named Graphs [Ref]. This last proposal works like this, if RDF graph docA has triples like:

<#Romeo> r:loves <#Juliet>
<#Juliet> r:daugherOf <#LadyCapulet>
<#Mercutio> r:friendOf <#Romeo>

we can name this graph with an URI ‘graphA’:

<graphA> {
  <#Romeo> r:loves <#Juliet>
  <#Juliet> r:daugherOf <#LadyCapulet>  
  <#Mercutio> r:friendOf <#Romeo>
}

The same can be done for RDF graph docC:

  <#Romeo> r:loves <#Juliet>

with a name ‘graphC’ we get:

<graphC> {
   <#Romeo> r:loves <#Juliet>
}

The Named Graph approach defines that any statment about a graph name (like graphA, graphC) is a statement about the graph-as-a-whole. It is now possible to compare both statements “Romeo loved Juliet” and find out that one was produced by graphA and the other by graphC. Named Graph-enabled triples store (e.g. Jena) add this name as extra information which can be used in SPARQL queries. We can also add statemnts about the graph-as-a-whole. E.g.

<graphA> {
  <#Romeo> r:loves <#Juliet>
  <#Juliet> r:daugherOf <#LadyCapulet>  
  <#Mercutio> r:friendOf <#Romeo>
  <graphA> d:creator <#Peter>
}

Here we made ‘Peter’ the creator of the RDF graph named ‘graphA’.

Named Graphs, quads are gaining very fast popularity in the Semantic Web community with projects such as ORE, POWDER and OWL seeking ways to add metadata to (sub)graphs [Ref]. Unfortunately, serialization of Named Graphs in RDF/XML documents is problematic. There is no support for adding names in the current XML format. One suggestion is to use the URI of the RDF/XML document itself as the name of the graph. E.g. if I would create an RDF/XML document like [namespaces declarations omitted]:

<RDF>
 <Description ID=”Romeo”>
   <r:loves resource”#Juliet”/>
 </Description>
…
</RDF>

, and would publish this as a “doc1”. Then, the Named Graph triples would become:

 <doc1> {
   <#Romeo> r:loves <#Juliet>
 }

This method has the disadvantage that the URI used to name the graph is terribly overloaded. ‘doc1’ is used as the location of the RDF/XML graph and as the name of the graph, conflicts can occur. E.g. when I create a triple:

  <doc1> r:owner “root”

Is the graph owned by ‘root’ (as in UNIX ownership) or the RDF/XML document? Probably the latter.

Another possible solution is to use a construct called ‘xml:base’ which provides a base URI for XML documents, and define this xml:base as graph name:

 <RDF xml:base=”ABCD”>
  <Description ID=”Romeo”>
   <r:loves resource=”#Juliet”/>
   </Description>
   …
 </RDF>

Which would result in these triples:

 <ABCD> {
   <#Romeo> r:loves <#Juliet>
 }

This method (like the previous one) has the disadvantage that each (sub)graph you want to name should appear in a separate RDF/XML document, which can be problematic in many use cases.

A third proposal is being considered. By extending the RDF/XML with a new attribute rdf:graph, any description which carries this attribute will be ’stored’ in a graph named by the value of the attribute. E.g. if the RDF/XML in doc1 would contain:

<RDF>
  <Description ID=”Romeo” graph=”#gA”>
    <r:loves resource=”#Juliet”/>
  </Description>
  <Description ID="Romeo" graph="#gB">
    <r:sonOf resource="LordMontague"/>
  </Description>
  <Description ID=”gA” graph=”#gA”>
    <d:creator>Peter</d:creator>
  </Description>
  <Description ID=”gB” graph=”#gB”>
    <d:creator>Mary</d:creator>
  </Description>
</RDF>

Then this would be equivalent with these Named Graph triples:

<doc1> {
  <#Romeo> r:loves <#Juliet>
  <#Romeo> r:sonOf <#LordMontague">
  <#gA> d:creator "Peter"
  <#gB> d:creator "Mary"
}

<doc1#gA> {
 <#Romeo> r:loves <#Juliet>
 <#gA> d:creator "Peter"
}

<doc1#gB> {
 <#Romeo> r:sonOf <#LordMontague">
 <#gB> d:creator "Mary"
}

The semantics would mean that the graph “Romeo loves Juliet” was created by “Peter” and the graph “Romeo is son of Lord Montague” is created by “Mary”. These graph are quite trivial, they contain only one triple. But, the same technique could be used for graphs containing many triples, as shown in Fig 2.

Fig 2. Graphical view of Named Graphs

May 3, 2008

Understanding the OAI-ORE data model

Filed under: Uncategorized — Tags: , , , — hochstenbach @ 6:55 am

OAI-ORE is the new data exchange model proposed by the Open Archives group, currently still in alpha version, hopefully at the end of 2008 in version 1.0 release. I have to admit that the data model and news groups have some tough reading when you are not totally into Semantic Web technologies like OWL, Named Graphs, SPARQL and the various RDF Vocabularies. I’ll try to make some general observations on OAI-ORE from a programmer’s viewpoint without having to go into much detail, syntax and semantics. But, how daunting the details may be, this is a quite exciting subject area attracting huge audiences.

Three concepts are important when reading the OAI-ORE specs: resources, aggregations and named graphs. These are the building blocks on which the OAI-ORE is based. Understanding their meaning and usage will give you much insight in the data model.

I

First, resources. We may all know what resources are. You download them from the Internet or point to them in your HTML href attributes. But from a formal RDF standpoint resources are something quite abstract. If you read the RDF Semantics document, it states that a resource is a very generic term that can stand for ‘anything in the universe of discourse’. An RDF resource could be an image of Bill Gates, a poem by William Shakespeare, but also my coffee mug on my table or a concept like General Relativity. Anything that can be named can be an RDF resource. How do you name resources? With URI’s. These URI names are like the java.util.URI class in Java. They have all the attributes and behavior of URI’s but you don’t use them to fetch data. In RDF Semantics we read: “[RDF] semantics doesn’t assume any particular relationship between […] a URI […] and a document or Web resource which can be retrieved by using that URI reference.” And in RDF Concepts: “[…] nothing requires that an RDF application be able to retrieve any representation of resources identified by the URI’s”. In practice this means that RDF applications use URI as names (comparing names and defining relations between names), not as actionable entities (having MIME types, sizes, last modification dates). This makes the RDF world a network overlay, relating resource by their name without much participation in the Web Architecture.

Things change when RDF graphs are serialized in documents that are a part of the Web Architecture. Then we get the vision of the Sematic Web of Tim Berners-Lee. Here URI’s are not only name things but also point to a RDF/XML documents which contain new URI’s which point to other RDF/XML etc. URI’s in this form are resource locators as in URL’s that have MIME types, protocols, gateways, proxies, etc. When we read the OAI-ORE documents and see protocol-based URI’s, read URL’s. The resources in OAI-ORE not only name things, but should also be used to retrieve things. There is some discussion how the OAI-ORE ecology should work (how much to retrieve, when to retrieve, when are links between two resources conceptional or physical, etc). But this is the basic idea: OAI-ORE, behaves like HTML in treating resources. No abstract naming, real linking.

II

Second, aggregations. Why or when do we use OAI-ORE? We need OAI-ORE to identify a group of web resources (read URL’s) and make assertions about this group. OAI-ORE originates from the library world so most examples will use publications. Take Marvin Minsky “K-Lines: A Theory of Memory article in MIT’s institutional repository: http://dspace.mit.edu/handle/1721.1/5739. This is a HTML page created for humans to understand.

We see a metadata record and two versions of Marvin’s article one in PS format and one in PDF format. Computers however don’t have that easy time. They see this:

<a target="_blank" href="/bitstream/1721.1/5739/1/AIM-516.ps">View/Open</a>
[..]

<a target="_blank" href="/bitstream/1721.1/5739/2/AIM-516.pdf">View/Open</a>

A programmer would need to screen scrape this page, find the href’s which contain the ‘/bitstream/’ paths, and infer that these could be the same versions of a document based on the extension of the filename. This is not an easy thing to do on a world wide scale. It would be better to have a machine readable RDF/XML document which provides a structured view of this webpage with all the semantics to understand the relations between all the resources. In a Semantic Web World we would like to refer to Marvin’s article. Wouldn’t it be better still to point to a address where a machine readable RDF/XML document could be found? This is OAI-ORE again.

There is some discussion how best to serialize OAI-ORE documents, in RDF/XML or Atom. But this is the idea: OAI-ORE is especially suited to talk about groups/aggregations of resources. Because, OAI-ORE is based on RDF it can do a lot more than that and will interact with other RDF resources on the Semantic Web. But talking about groups/aggregations of web resources (giving them identifiers, creators, provenance), this is ORE’s niche.

Why is an OAI-ORE Aggregation that special, you might ask? Couldn’t existing concepts such as RDF containers or RDF collections be used? Indeed, OAI-ORE Aggregations behave a bit like rdf:Bag’s in some situations, rdf:Seq’s and rdf:Alt’s in others. Aggregations are a generic RDF container, but add a lot more semantics to it. As you might know, RDF Container’s have a very limited vocabulary. As is stated in the RDF Semantics: ”any ‘natural’ assumptions concerning RDF containers are not formally sanctioned by the RDF model theory”. OAI-ORE tries to get some of those ‘natural’ assumptions back into aggregations.

III

Third, Named Graphs. RDF statements are triples composed out of a subject, a predicate and an object. Create many triples and you’ve have an RDF graph. When exchanging and merging RDF graphs between computers, there is a problem that you loose track on who created the graph. There is no good way in RDF to add provenance metadata to an RDF graph. When merging two graph in a triple store this information is usually lost. This is bad for aggregations. You want to know who made the aggregation, at what time, for what reason. You could do this by adding new triples to the graph stating that each of them was created by Mr. X. This is a process called reification, but is a bit problematic for two reasons: 1) you need very many triples to create a statement about a whole graph, 2) even if you did all the work, the triples do not mean what you might hope.

To solve this problem extensions to the RDF model and syntax are proposed, none of them have made it into a W3C recommendation, alas. OAI-ORE is using the Named Graph approach created by Carroll, Bizer, Hayes and Stickler. The idea is to add to every graph a name, a URI. When using this URI in a triple you are creating by definition statements about the whole graph. There is some discussion how this graph name is best serialized, because RDF doesn’t have any notion of Named Graphs. OAI-ORE is using a combination of two proposals:

  • If you retrieve an RDF/XML document which contains a xml:base, then this is the name of the graph.
  • If you retrieve an RDF/XML document which doesn’t have a xml:base, then the URL location of the document is the name of the graph.

Both have its advantages and disadvantages. The most problematic is that the name of a graph is used as the identifier of a resource, name of a graph and web location of a graph. So, when writing RDF applications for Named Graphs you need to take great care to figure out what a URI really means. Another problematic point is that every Named Graph should be in a separate document. For instance, it would not be possible to give names to two sub-graphs using only one OAI-ORE document. And third, it is not clear what happens to the model when you mix graphs with names and graph without names.

Named Graphs in OAI-ORE are used to give a name to the aggregation graph. This graph has been given the name ‘Resource Map’ with an URI called URI-R. This URI-R is because of the identifier problems mentioned above, the name of the aggregation graph, the identifier of a ‘Resource Map’ document used to add provenance metatdata to the aggregation graph and the location of the ‘Resource Map’ document itself. Reading the data model one needs to separate these different usages. To refer to a aggregation not in context of an URI-R is also possible by using the name of the aggregation container itself called the URI-A.

Concluding, an OAI-ORE aggregation is a resource with name URI-A. The aggregation behaves a bit like an RDF container and contains resources which can all be interrelated. All the triples of the aggregation and its (inter)relations make up a graph. This graph has been given a name URI-R. This URI-R is used as a resource to add provenance metadata to the aggregation graph.

I hope these concepts give you some clues how to interpret the OAI-ORE datamodel. Personally I think OAI-ORE is a very good step in the direction of a truely Semantic Web. But, I am still lagging behind all the acronyms and consequences of those acronyms used in the data model.

April 20, 2008

Compound Objects, Maps, Embedding and Web Services

Filed under: Uncategorized — Tags: , , , , , , , , , , , , , , , , , — hochstenbach @ 9:22 am

In the Driver-I project institutional repositories expose their metadata on freely available publications via Dublin Core XML records on the network. These nicely structured machine readable records can be harvested via the OAI-PMH protocol, indexed, made searchable, disassembled for use in layouting of search result (display only the title, and authors, for instance), grouped in citation lists, by having an identifier they can be referenced and reasoned about using technologies as RDF. All this is possible because these Dublin Core records contain all the semantics needed for reuse of information (in this case metadata about the publication).

For the publications themselves, it is not that easy. Institutional repositories expose publications using, so called, splash pages. These pages contain a description of the publication, its abstract, and links to (mostly) binary PDF formats. Although, the splash pages have structure (HTML), this information is mostly used for presentational purposes: presenting titles, italics, lists, anchors (not only to the publication but also to library homepages, next/prev buttons), etc. Even worse, the publications themselves are binary files most of the time. Reuse of information as in identification, disassembly, indexing and searchability is not that easily achieved. One needs the Google, Microsofts and all smart software engineers of this world, to extract the necessary data from the pages and binary files and to interpret them.

What are the options? Can Driver-II go beyond the splash page and get easier access not only to metadata but also to the publications? I’m reviewing some state of the art technologies trying to find an answer. From a technology standpoint I see five routes available. My classification is not very strict and gives only a direction into which technologies tend to move:

  1. Envelopes, compound objects or packaging formats as I call them. It is very hard to come up with a good names because depending on the context they have multiple usages. These formats provide access to the metadata, structural data, identifiers, and binary streams of publications all in one package. They tend to give a complete description and have ideally no external dependencies. Examples are: METS, MPEG-21/DIDL, LOM/IMS, OpenOffice packages.
  2. Overlays, maps, feeds. These formats provide an overlay on top of an existing network of internet resources. They tend to group references to resources, identify them and describe the content, structure and relations of all parts. Examples are: RDF, ORE, POWDER, TopicMaps, Atom, RSS, Sitemaps.org, ROR.
  3. Embedding, or extending existing resources. Here no new resources are introduced on the network, but existing resources are ‘beautified’ by adding semantic annotations. Examples are: RDFa, Microformats, XMP
  4. New/old publishing formats: HTML and XML is not dead, with new HTML versions and XML publishing formats a whole new range of open semantically rich documents become available. Examples: HTML5, XHTML, ODF, OOXL, OPF
  5. Web services. I confess, this is a bit of a catch all. The other three formats are very static, there is no interaction needed with a dynamic service to extract all the information needed. For web services you need to add API’s (in addition to OAI-PMH) on top of digital repositories to answer questions from agents on the content of your collections. Examples are: GData, O.K.I, unAPI

I’ll try to present some thought on these in the next posts…

April 12, 2008

Arriving late…

Filed under: Uncategorized — Tags: , , , , , , — hochstenbach @ 7:05 am

I’m feeling arriving late to a party. When did Web2.0 arrive at the scene? More 2004-ish I read now on Wikipedia. Anyway, I was reading John Allsopp’s Microformats: Empowering Your Markup for Web 2.0 and I thought it would be best to go from theory into practice immediately. I signed up for all the gadgets everyone is talking about (the Facebook’s, Twitter’s, del.icio.us’s, Flock’s of this world). Let me see what this brings. I’ll do the facebooking for my personal interests (books that I read, friend that I know), twittering for keep myself busy with what (?), del.icio.using for all the work related websites I read, and now wordpressing probably to give some comments on the publications I read.

Blog at WordPress.com.