Week 11: The Semantic Web

We have now covered most of the important aspects of Digital Asset Management and you should be familiar with the language and terms used.

In broad terms, effective digital asset management is about lending digital assets enough intelligence for them to be able to to re-present themselves to us in an accessible re-purposable format - at the right time - in the right place.

However, you will also have seen that many of the components of an effective management environment are either still in development, yet to be resolved or not interoperable with other systems of processes.

It is fitting that we conclude the content of this course with a reference to the powerful vision or blueprint for how it might all work - eventually.

Tim Berners_Lee
Tim Berners-Lee

This vision comes from Tim Berners Lee (the architect of the world wide web). He calls it the Semantic Web. A dedicated team of people at the World Wide Web consortium (W3C) are working to improve, extend and standardize the Sematic Web, and many languages, publications, tools have already been developed.

Industry debate

See: Google exec challenges Berners-Lee

Resource Description Framework (RDF)

The Semantic Web is built on syntaxes (see definitions used in week 2) which use URIs to represent data, in structures called triples: A triple of URI data can be held in a database, or interchanged on the world Wide Web using a set of particular syntaxes developed especially for the task. These syntaxes are called the ‘Resource Description Framework (RDF)’. RDF is very simple. It is no more than a way to express and process a series of simple assertions.

You may recall that a URI (Uniform Resource Identifier) is a simple and extensible way of identifying a resource. Just like the strings starting with “http:” or “ftp:” that you find on the World Wide Web. Anyone can create a URI, and their ownership is clearly delegated, so they form an ideal base technology on which to build global interconnections between resources.

See RFC 2396 as the general URI specification

A triple can simply be described as three URIs. The language that utilises three URIs in such a way is called RDF. RDF XML is considered to be the standard interchange format for RDF on the Semantic Web, although this is not the only format.

Triples like other concepts in RDF are taken from logic and linguistics, where subject-predicate and subject-predicate-object structures have meanings similar to, yet distinct from, the uses of those terms in RDF. For example:

In the English language statement ‘Digital Asset Management Week 11′ was authored by Simon Pockley, ‘Digital Asset Management Week 11′ would be the subject, ‘authored-by’ the predicate and ‘Simon Pockley’ the object. This diagram shows the common graphical representation of RDF statements, introduced in the RDF Model and Syntax. Note that the object is a string: “Simon Pockley”. This is called a “literal” in RDF, but an object could also be a resource.

An RDF Statement (graph)
rdfmodel01.gif

The statements are called triples because there are three predominant parts (subject, object, and predicate). When encoded as an RDF triple, the subject and predicate are resources named by URIs. The object could be a resource or literal element. Databases of such triples have been shown to be scalable to many millions of triples, mostly because of the simplicity of this information.

A slightly more complex statement might look like:

Small RDF Model
Small RDF model

This diagram shows several RDF statements combined into a single diagram. All RDF is pretty much an expansion of this syntax. RDF defines a directed graph of statements that describe resources. As you can see, I have replaced the literal “Simon Pockley” in the original statement with a URI representing this person, this in turn is the subject of several more statements. Such a collection of RDF statements is called a model in RDF.

While this scaleable structure might seem too simple to be such an important technology, it is this simplicity that makes it so powerful. Computer science already has plenty to say about the effectiveness of graphs for representing information. RDF allows many simple statements to be aggregated so that machine agents can apply the well-tested graph traversal techniques to glean data.

What does it look like in XML?

The abstract representation above is the basis of RDF, but it is quite impractical for exchanging RDF descriptions and placing such descriptions in HTML and XML content. Triples are one of several standard formats for RDF. The triple above can also be equivalently represented in the standard RDF/XML format and the model in Graph 1 might also be rendered as:

XML serialization of the RDF model

rdfxml

The purpose of RDF is to provide an encoding and interpretation mechanism so that resources can be described in a way that particular software can understand it, or, better put, so that software can more easily access data organized within structured parameters.

The flexibility of RDF syntax makes it easy to apply RDF processing to existing XML. One constant in all RDF serializations is the use of the element rdf:RDF to wrap the RDF statements.

Note the use of XML namespaces. Namespaces were introduced in week 4. RDF relies heavily on XML namespaces for disambiguating names. There are several element and attribute names that must be in the namespace defined by RDF. All RDF predicates must use a namespace to clarify their meaning.

Inside the RDF wrapper element, a description element indicates the subject of the enclosed statements. This example uses the about attribute, which points to an external resource as the subject. There is one statement with this resource as subject, marked by the element ‘authored-by’, which forms the predicate.

Note that this element has the namespace http://schemas.duckdigital.net/rdfexample/.

According to RDF, this is translated to an abstract model in which the actual predicate is formed by joining the namespace URI and local name of the predicate element. So the full predicate of this statement is http://schemas.duckdigital.net/rdfexample/authored-by.
The remaining part of the statement, is the object. But the object of the first statement is not very clear. RDF handles the case in which the object of a statement is a resource but doesn’t really have an external URI. In the example, the resource representing the person named Simon Pockley is such a case, and is actually represented by the embedded Description element with an ID attribute. The URI of this resource becomes the joining of the URI of the RDF file as a whole, and the value of the ID attribute. Note that RDF takes this arcane concept (one of many) even further by allowing fully anonymous resources without even an ID.

The resource with ID “duckdigital.net” itself is the subject of two statements, with predicates represented by the child elements name and nationality. Note that these predicates are also in the http://schemas.duckdigital.net/rdfexample/ namespace. The object of these statements are literals: “Simon Pockley” and “Australian”, respectively.

Web Ontology Language (OWL)

An ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. The term is borrowed from philosophy, where an Ontology is a systematic account of Existence. OWL is a vocabulary extension of the Resource Description Framework (RDF). Together with RDF and other components, these tools make up the Semantic Web project.

The Semantic Web needs ontologies with a significant degree of structure. These need to specify descriptions for the following kinds of concepts:

  • Classes (general things) in the many domains of interest
  • The relationships that can exist among things
  • The properties (or attributes) those things may have

OWL represents the meanings of terms in vocabularies and the relationships between those terms in a way that is suitable for processing by software. The OWL specification is maintained by the World Wide Web Consortium (W3C).

OWL is seen as a major technology for the future implementation of a Semantic Web. OWL was designed specifically to provide a common way to process the content of web information. The language is intended to be read by computer applications instead of by humans. And because OWL is written in XML, OWL information can be easily exchanged between different types of computers using different operating systems, and application languages. OWL’s main purpose will be to provide standards that provide a framework for asset management, enterprise integration and the sharing and reuse of data on the Web. OWL was developed mainly because it has more facilities for expressing meaning and semantics than XML, RDF, and RDF-S, and thus OWL goes beyond these languages in its ability to represent machine interpretable content on the web.

Multimedia collections

Ontologies can be used to provide semantic annotations for collections of images, audio, or other non-textual objects. It is even more difficult for machines to extract meaningful semantics from multimedia than it is to extract semantics from natural language text. Thus, these types of resources are typically indexed by captions or metatags. However, since different people can describe these non-textual objects in different ways, it is important that the search facilities go beyond simple keyword matching. Ideally, the ontologies would capture additional knowledge about the domain that can be used to improve retrieval of images.

Multimedia ontologies can be of two types: media-specific and content-specific. Media specific ontologies could have taxonomies of different media types and describe properties of different media. For example, video may include properties to identify length of the clip and scene breaks. Content-specific ontologies could describe the subject of the resource, such as the setting or participants. Since such ontologies are not specific to the media, they could be reused by other documents that deal with the same domain. Such reuse would enhance search that was simply looking for information on a particular subject, regardless of the format of the resource. Searches where media type was important could combine the media-specific and content-specific ontologies.

As an example of a multimedia collection, consider an archive of images of antique furniture. An ontology of antique furniture would be of great use in searching such an archive. A taxonomy can be used to classify the different types of furniture. It would also be useful if the ontology could express definitional knowledge. For example, if an indexer selects the value “Late Georgian” for the style/period of (say) an antique chest of drawers, it should be possible to infer that the data element “date.created” should have a value between 1760 and 1811 A.D. and that the “culture” is British. Availability of this type of background knowledge significantly increases the support that can be given for indexing as well as for search. Another feature that could be useful is support for the representation of default knowledge. An example of such knowledge would be that a “Late Georgian chest of drawers,” in the absence of other information, would be assumed to be made of mahogany. This knowledge is crucial for real semantic queries, e.g. a user query for “antique mahogany storage furniture” could match with images of Late Georgian chests of drawers, even if nothing is said about wood type in the image annotation.

Why is RDF important?

Increasingly, to be competitive, Web applications must assemble data from diverse sources and services; furthermore, requirements for such applications tend to be far more fluid in “Internet time.” This is the sort of environment in which the extensibility of both XML and RDF really pays dividends. XML allows great flexibility for adaptation of data formats, and RDF provides great flexibility for adaptation of data-processing rules.

RDF can provide Web-based applications an “escape hatch” from the strictures of traditional database design and application evolution. Some folks have been complaining for years that traditional database management tools are too highly structured, and therefore add hefty maintenance costs when the real world inevitably changes around the application.

Tools and practical examples of RDF at work

Let’s construct an example.

Navy data

Melbourne Standard time zone: UTC/GMT +10 hours

Latitude: 37° 52′ South

Longitude: 145° 08′East

1. Sunset Sunrise Calculator

2. Submit the URI of a dutch seascape to Flickr RDF generator to generate an RDF description

References


About this entry