In the Driver-I project institutional repositories expose their metadata on freely available publications via Dublin Core XML records on the network. These nicely structured machine readable records can be harvested via the OAI-PMH protocol, indexed, made searchable, disassembled for use in layouting of search result (display only the title, and authors, for instance), grouped in citation lists, by having an identifier they can be referenced and reasoned about using technologies as RDF. All this is possible because these Dublin Core records contain all the semantics needed for reuse of information (in this case metadata about the publication).
For the publications themselves, it is not that easy. Institutional repositories expose publications using, so called, splash pages. These pages contain a description of the publication, its abstract, and links to (mostly) binary PDF formats. Although, the splash pages have structure (HTML), this information is mostly used for presentational purposes: presenting titles, italics, lists, anchors (not only to the publication but also to library homepages, next/prev buttons), etc. Even worse, the publications themselves are binary files most of the time. Reuse of information as in identification, disassembly, indexing and searchability is not that easily achieved. One needs the Google, Microsofts and all smart software engineers of this world, to extract the necessary data from the pages and binary files and to interpret them.
What are the options? Can Driver-II go beyond the splash page and get easier access not only to metadata but also to the publications? I’m reviewing some state of the art technologies trying to find an answer. From a technology standpoint I see five routes available. My classification is not very strict and gives only a direction into which technologies tend to move:
- Envelopes, compound objects or packaging formats as I call them. It is very hard to come up with a good names because depending on the context they have multiple usages. These formats provide access to the metadata, structural data, identifiers, and binary streams of publications all in one package. They tend to give a complete description and have ideally no external dependencies. Examples are: METS, MPEG-21/DIDL, LOM/IMS, OpenOffice packages.
- Overlays, maps, feeds. These formats provide an overlay on top of an existing network of internet resources. They tend to group references to resources, identify them and describe the content, structure and relations of all parts. Examples are: RDF, ORE, POWDER, TopicMaps, Atom, RSS, Sitemaps.org, ROR.
- Embedding, or extending existing resources. Here no new resources are introduced on the network, but existing resources are ‘beautified’ by adding semantic annotations. Examples are: RDFa, Microformats, XMP
- New/old publishing formats: HTML and XML is not dead, with new HTML versions and XML publishing formats a whole new range of open semantically rich documents become available. Examples: HTML5, XHTML, ODF, OOXL, OPF
- Web services. I confess, this is a bit of a catch all. The other three formats are very static, there is no interaction needed with a dynamic service to extract all the information needed. For web services you need to add API’s (in addition to OAI-PMH) on top of digital repositories to answer questions from agents on the content of your collections. Examples are: GData, O.K.I, unAPI
I’ll try to present some thought on these in the next posts…