On digital libraries

May 11, 2008

<about> { Named Graphs }

Filed under: Uncategorized — Tags: , , , , , , — hochstenbach @ 5:15 pm

In my previous post I’ve explained that RDF graphs are collections of triples each containing a subject, a predicate and an object. The standard serialization for RDF graphs are RDF/XML documents which are made available on the World Wide Web. RDF agents harvest these RDF/XML documents and store the resulting merged graph in triple stores. These triple stores can be queried using the SPARQL language, which is the SQL of the Semantic Web. SPARQL is used for further processing of the information carried in the RDF graphs.

Fig 1. Simplified view of an Semantic Web application

One problem with storing RDF/XML graphs in a triple store is that information about the origin of the RDF graph could be lost, depending on the triple store used [Ref]. To show how this can happen take a look at Fig 1. An RDF graph is created which contains statements about characters appearing in one of Shakespeare’s plays. The triples of the RDF graph live, for the sake of argument, as some information spread around in tables and columns of a database application. We use a pseudo N3 notation in this blog to display the in memory view of the docA RDF graph:

<#Romeo> r:loves <#Juliet>
<#Juliet> r:daugherOf <#LadyCapulet>
<#Mercutio> r:friendOf <#Romeo>

To serialize this RDF graph an RDF/XML document ‘docA’ is created and published on a webserver. There are other RDF/XML documents on the Internet called ‘docB’, ‘docC’, etc. Here is the in memory view of docB:

<#Romeo> r:sonOf <#LordMontague>

Note, the RDF/XML serialization of these graphs are not shown in these examples.

An RDF application harvests all these records, processes them and stores them in a triple store. The graphs are not stored as two seperate documents. No, a merged graph is created which contains the combined triples of all documents. If the two graphs docA and docB both contain statements about Romeo, then all these statements will be thrown on a heap in the triple store:

 <#Romeo> r:loves <#Juliet>
 <#Romeo> r:sonOf <#LordMontague>
 <#Juliet> r:daugherOf <#LadyCapulet>
 <#Mercutio> r:friendOf <#Romeo>

Without special precautions, it is not possible to say which graph made which statement on Romeo.

To solve this problem, a process called RDF reification can be used. Reification are statements about stamentens. We could say that “Romeo loves Juliet” was created by docA:

 <docA> a:type a:Statement
 <docA> a:subject <#Romeo>
 <docA> a:predicate r:loves
 <docA> a:object <#Juliet>

Which means something like “docA, says: ‘Romeo loves Juliet’”. Do this for all the statements in all the graphs in docA, docB, .., store them again in the triple store, and you will have the context in which all statements were made. This is correct and works most of the time. However, formally you’ve created something that might mean something different than you hope [Ref]. RDF has powerful expressive power with layered semantics on top of which ontologies, rules, logic and proof of statements can be added. Reification adds to RDF the ability to create statements about statements. But, the resulting reified triple doesn’t have the same expressive power. A reified triple isn’t the triple itself. If we created exactly the same reified triple for a docC:

 <docC> a:type a:Statement
 <docC> a:subject <#Romeo>
 <docC> a:predicate r:loves
 <docC> a:object <#Juliet>

, then we can’t conclude that the same statement “Romeo loves Juliet” appears in both documents [Ref]. In RDF, reification is not a quoting mechanism [Ref].

Over the years extensions are proposed to the RDF to add contextual information to RDF graphs in other ways. One proposal is to move from triples (subject, predicate, object) to quads (context, subject, predicate, object) [Ref]. But this solution is dependent on client-side adaptation. Another proposal is to give names (URI’s) to the (sub)graphs by RDF graph creators, in a solutoin called Named Graphs [Ref]. This last proposal works like this, if RDF graph docA has triples like:

<#Romeo> r:loves <#Juliet>
<#Juliet> r:daugherOf <#LadyCapulet>
<#Mercutio> r:friendOf <#Romeo>

we can name this graph with an URI ‘graphA’:

<graphA> {
  <#Romeo> r:loves <#Juliet>
  <#Juliet> r:daugherOf <#LadyCapulet>  
  <#Mercutio> r:friendOf <#Romeo>
}

The same can be done for RDF graph docC:

  <#Romeo> r:loves <#Juliet>

with a name ‘graphC’ we get:

<graphC> {
   <#Romeo> r:loves <#Juliet>
}

The Named Graph approach defines that any statment about a graph name (like graphA, graphC) is a statement about the graph-as-a-whole. It is now possible to compare both statements “Romeo loved Juliet” and find out that one was produced by graphA and the other by graphC. Named Graph-enabled triples store (e.g. Jena) add this name as extra information which can be used in SPARQL queries. We can also add statemnts about the graph-as-a-whole. E.g.

<graphA> {
  <#Romeo> r:loves <#Juliet>
  <#Juliet> r:daugherOf <#LadyCapulet>  
  <#Mercutio> r:friendOf <#Romeo>
  <graphA> d:creator <#Peter>
}

Here we made ‘Peter’ the creator of the RDF graph named ‘graphA’.

Named Graphs, quads are gaining very fast popularity in the Semantic Web community with projects such as ORE, POWDER and OWL seeking ways to add metadata to (sub)graphs [Ref]. Unfortunately, serialization of Named Graphs in RDF/XML documents is problematic. There is no support for adding names in the current XML format. One suggestion is to use the URI of the RDF/XML document itself as the name of the graph. E.g. if I would create an RDF/XML document like [namespaces declarations omitted]:

<RDF>
 <Description ID=”Romeo”>
   <r:loves resource”#Juliet”/>
 </Description>
…
</RDF>

, and would publish this as a “doc1”. Then, the Named Graph triples would become:

 <doc1> {
   <#Romeo> r:loves <#Juliet>
 }

This method has the disadvantage that the URI used to name the graph is terribly overloaded. ‘doc1’ is used as the location of the RDF/XML graph and as the name of the graph, conflicts can occur. E.g. when I create a triple:

  <doc1> r:owner “root”

Is the graph owned by ‘root’ (as in UNIX ownership) or the RDF/XML document? Probably the latter.

Another possible solution is to use a construct called ‘xml:base’ which provides a base URI for XML documents, and define this xml:base as graph name:

 <RDF xml:base=”ABCD”>
  <Description ID=”Romeo”>
   <r:loves resource=”#Juliet”/>
   </Description>
   …
 </RDF>

Which would result in these triples:

 <ABCD> {
   <#Romeo> r:loves <#Juliet>
 }

This method (like the previous one) has the disadvantage that each (sub)graph you want to name should appear in a separate RDF/XML document, which can be problematic in many use cases.

A third proposal is being considered. By extending the RDF/XML with a new attribute rdf:graph, any description which carries this attribute will be ’stored’ in a graph named by the value of the attribute. E.g. if the RDF/XML in doc1 would contain:

<RDF>
  <Description ID=”Romeo” graph=”#gA”>
    <r:loves resource=”#Juliet”/>
  </Description>
  <Description ID="Romeo" graph="#gB">
    <r:sonOf resource="LordMontague"/>
  </Description>
  <Description ID=”gA” graph=”#gA”>
    <d:creator>Peter</d:creator>
  </Description>
  <Description ID=”gB” graph=”#gB”>
    <d:creator>Mary</d:creator>
  </Description>
</RDF>

Then this would be equivalent with these Named Graph triples:

<doc1> {
  <#Romeo> r:loves <#Juliet>
  <#Romeo> r:sonOf <#LordMontague">
  <#gA> d:creator "Peter"
  <#gB> d:creator "Mary"
}

<doc1#gA> {
 <#Romeo> r:loves <#Juliet>
 <#gA> d:creator "Peter"
}

<doc1#gB> {
 <#Romeo> r:sonOf <#LordMontague">
 <#gB> d:creator "Mary"
}

The semantics would mean that the graph “Romeo loves Juliet” was created by “Peter” and the graph “Romeo is son of Lord Montague” is created by “Mary”. These graph are quite trivial, they contain only one triple. But, the same technique could be used for graphs containing many triples, as shown in Fig 2.

Fig 2. Graphical view of Named Graphs

April 20, 2008

Compound Objects, Maps, Embedding and Web Services

Filed under: Uncategorized — Tags: , , , , , , , , , , , , , , , , , — hochstenbach @ 9:22 am

In the Driver-I project institutional repositories expose their metadata on freely available publications via Dublin Core XML records on the network. These nicely structured machine readable records can be harvested via the OAI-PMH protocol, indexed, made searchable, disassembled for use in layouting of search result (display only the title, and authors, for instance), grouped in citation lists, by having an identifier they can be referenced and reasoned about using technologies as RDF. All this is possible because these Dublin Core records contain all the semantics needed for reuse of information (in this case metadata about the publication).

For the publications themselves, it is not that easy. Institutional repositories expose publications using, so called, splash pages. These pages contain a description of the publication, its abstract, and links to (mostly) binary PDF formats. Although, the splash pages have structure (HTML), this information is mostly used for presentational purposes: presenting titles, italics, lists, anchors (not only to the publication but also to library homepages, next/prev buttons), etc. Even worse, the publications themselves are binary files most of the time. Reuse of information as in identification, disassembly, indexing and searchability is not that easily achieved. One needs the Google, Microsofts and all smart software engineers of this world, to extract the necessary data from the pages and binary files and to interpret them.

What are the options? Can Driver-II go beyond the splash page and get easier access not only to metadata but also to the publications? I’m reviewing some state of the art technologies trying to find an answer. From a technology standpoint I see five routes available. My classification is not very strict and gives only a direction into which technologies tend to move:

  1. Envelopes, compound objects or packaging formats as I call them. It is very hard to come up with a good names because depending on the context they have multiple usages. These formats provide access to the metadata, structural data, identifiers, and binary streams of publications all in one package. They tend to give a complete description and have ideally no external dependencies. Examples are: METS, MPEG-21/DIDL, LOM/IMS, OpenOffice packages.
  2. Overlays, maps, feeds. These formats provide an overlay on top of an existing network of internet resources. They tend to group references to resources, identify them and describe the content, structure and relations of all parts. Examples are: RDF, ORE, POWDER, TopicMaps, Atom, RSS, Sitemaps.org, ROR.
  3. Embedding, or extending existing resources. Here no new resources are introduced on the network, but existing resources are ‘beautified’ by adding semantic annotations. Examples are: RDFa, Microformats, XMP
  4. New/old publishing formats: HTML and XML is not dead, with new HTML versions and XML publishing formats a whole new range of open semantically rich documents become available. Examples: HTML5, XHTML, ODF, OOXL, OPF
  5. Web services. I confess, this is a bit of a catch all. The other three formats are very static, there is no interaction needed with a dynamic service to extract all the information needed. For web services you need to add API’s (in addition to OAI-PMH) on top of digital repositories to answer questions from agents on the content of your collections. Examples are: GData, O.K.I, unAPI

I’ll try to present some thought on these in the next posts…

Blog at WordPress.com.