This blog now seems to be officially shuttered for the summer. ‘Cause it’s sunny out.
Elsewhere, though, there’s this: Metadata, Mark II, an overview of some nifty metadata technologies.
Update, 2008: Webmonkey shuttered its doors not too long after this article was published. While outdated already, I’ve pasted the original text of the article below – ’twas one of the last freelance writing bits I did while living in Rome.
Metadata, Mark II: FOAF, RDF, GeoURL, and SMBmeta
Remember META tags? Once upon a time, a finely crafted META keyword tag would get you the bourgeois treatment from search engines. You could specify exactly which search words should be associated with your site and, best of all, META tags were invisible to users, allowing webmasters a touch of the ol’ “editorial liberty.”
Yeah. *That* didn’t last. Almost instantly, META tags were abused and mis-used by pageview-hungry Web developers, who crammed all sorts of irrelevant and naughty keywords in their pages, trying to shunt the flow of Web traffic their way. And now today Google and other search engines essentially ignore META keyword tags.
(Of course, if you’re absolutely adamant that your page be promoted in response to specific search terms, Google, Yahoo, HotBot and the gang are happy to help, but with an improved targeted-placement technique far less attractive to spammers: It’s called Advertising, and it costs cash-money.)
End of story? That’d be sad, indeed, because META keyword tags were a rather sweet idea, at least on paper: short, sensible descriptions of your site, tailored so that machines could quickly read and index it, and subsequently help people find it.
Well, META’s not dead.
In the pages that follow, I’ll be giving you a bird’s eye view of a few independent technologies, each aspiring to get useful *metadata* back into the Web. Some are homegrown, some corporate, and some academic, but all of them let you enhance your site with useful information and improve the ways your site is associated with other sites. Sound interesting? Good, then here’s the game plan:
1. We’ll start with an explanation of that *metadata* word (so we can finally quit italicizing it).
2. Next comes a tour of the platitudes and latitudes of GeoURL, a fun, on-your-site-in-just-ten-minutes META tag that pinpoints your webpage’s real-world location with GPS-style accuracy.
3. Then we’ll check out SMBmeta, a newly launched metadata framework designed to give small businesses their fair share of the Web limelight.
4. We’ll finish up with a macro look at some of the “Semantic Web” standards favored by the W3C: Dublin Core and RDF — and we’ll show them off a bit with FOAF (Friend of a Friend), an application which leverages both those high-minded efforts.
OK then, let’s get started!
A lot of smart people (like Tim Berners-Lee, who merely *invented* the Web) are still laboring to make the big dream behind the old “META keyword” come true. That concept is Metadata, which, strictly speaking, means “data about data”, but in our context means “stuff describing your Web page as a whole”: who wrote it, what it’s about, related concepts or categories, the date it was written or updated, the language it’s written in, who controls the copyright, physical locations it describes, if there’s a Table of Contents, etc., etc.
The point is, nobody necessarily wants to *see* all those details cluttering every single Web page. But if that data were invisible, machine-readable, and used to describe both the contents and *context* of Web pages, that would open a lot of possibilities, allowing Things of Great Niftiness to ensue. The W3C calls this ambitious idea the “Semantic Web”:
“The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users.”
Scientific American, May 2001
So having a metadata-rich Web wouldn’t just improve our user experience as we search and surf the Web, but it would also augment the ability for “robots” or software agents to collect and process information on our behalf.
When people talk about the “Semantic” Web adding *meaning* to the Web, it’s not really for you and me — you and I generally understand whatever we’re reading, and know which links we need to click to get certain tasks done — it’s about adding *meaning* that machines can process and navigate.
Whoooo! Robots! Software Agents!
Indeed. Which brings us to this rather key caveat: In this tutorial, we’ll be looking at a mishmash of different technologies that are not yet widely adopted, and may well never be. (That includes the aforementioned Robot Agents.) As of press time, *not a one* of these technologies will deliver the slightest boost to your Google PageRank or your listing on HotBot. And there’s absolutely no guarantee they will in the future.
Still, ‘tis better to lead than follow, and more fun to fiddle with emergent tech than to wait for the “critical masses” to show up and master it before you, right?
Let’s start, then, with a metadata application already adopted by thousands of webloggers, because it’s fun and entertaining but keeps its feet firmly anchored in the real world: GeoURL.
Getting on the Map with GeoURL
With just two lines of code, GeoURL maps documents in cyberspace to real-world locations. Once you’ve added your site to GeoURL’s database, you can immediately see who else has registered Web pages in (or about) your neighborhood.
Here’s what the code looks like: The first line contains your Latitude and Longitude, and the second line contains your site’s name.
<meta name=“geo.position” content=“41.8833; 12.500” />
<meta name=“DC.title” content=“Professor Falken’s weblog” />
(here’s a fill-in-the-blanks example:)
<meta name=“geo.position” content=“xx.xxxx; yy.yyyy” />
<meta name=“DC.title” content=“your site’s name here” />
In old hacker lingo, these coordinates are called an “ICBM Address.” (Like, Inter-Continental Ballistic Missile.) The Cold War humor might now be passe, but you may still might want to use discretion here: If you aren’t comfortable publishing your street address to the Web, you can use broader coordinates here, like those corresponding to your postal ZIP, City Hall, etc.
So, where do you get these coordinates, anyways? Your own latitude and longitude can be grabbed from a GPS device, but otherwise, free Web services make it an easy lookup. For sites in the U.S., GeoURL’s Geocoder page is the easiest method; it quickly converts street addresses into regional coordinates. (Nifty, no? A crafty search spider which sniffed out postal addresses in web pages and indexed them by location with this technique won Google’s 2002 Programming Contest.)
For those outside the U.S., handy lookup services include the Getty Thesaurus of Geographic Names, GeoNET, and MultiMap, though you’ll need to jump through a few additional hoops. Additional lookup resources are available here, along with GeoURLs own documentation.
When you’ve figured out your coordinates, edit the Latitude, Longitude, and Your-Site-Name portions of the GeoURL tags, then drop them into the of your document. You’ll now need to request for the GeoURL crawler to visit your site, and in a few minutes, you’re listed &#!51; a twinkling blue pinpoint on their global map of sites!
Now, I chose GeoURL as our first example because it’s not only fun (been poking around your neighbors’ sites already?) and a fast setup, but because it nicely exemplifies how a “controlled vocabulary” makes these locations machine-understandable and easier to search for in a database. In this case, the “controlled vocabulary” is the numerical latitude and longitude coordinates of the ICBM address.
After all, while you might advertise your store’s address like this for newspaper readers:
“we’re two doors down from Starbucks, over by the mall.””
But a machine isn’t going to understand this:
<meta name=“geo.position” content=“ we’re two doors down from Starbucks, over by the mall.” />
I know what you’re thinking: Most folks don’t search for, say, local restaurants by keying in latitudes and longitudes. And that’s why controlled vocabs often include a thesaurus that maps equivalent relationships to one another — for instance, a list of City Names and GPS coordinates. Ideally, producers like you and I take care of machine-readable data, while search engines would build the thesauri.
There’s one more educational nugget in the GeoURL example: That “
DC.title” name used to identify your site’s name. “
DC.title” isn’t an arbitrary term, it’s a smart use of the
Dublin Core vocabulary, which we’ll cover in detail later.
While Dublin Core simmers on the backburner, let’s take a gander at SMBmeta.
Getting Busy with SMBmeta
SMBmeta knows its clientele: (S)mall and (M)edium-sized (B)usinesses. These entrepreneurial enterprises — like florists and lumberyards, candy stores and dentists — are the powerhouse of the U.S. economy. On their own, however, these local shops can’t afford an in-house Web team, or compete with Fortune-500-style advertising budgets. For many of them, simply being found on the Web is a challenge.
SMBmeta helps them by providing a “virtual Rolodex card” of sorts — a limited set of data fields, which when used like a fill-in-the-blanks template, can describe any business. The data fields cover all the small-biz essentials: name, description, address, parking, store hours, etc. To make sure it’s all easily machine-readable (and easier to search against), SMBmeta information is stored as XML, in which the tag attributes come from a controlled vocabulary but you can freely add your own descriptive content. Check out this code sample (lifted and condensed from SMBmeta’s docs), especially the line we bolded:
<?xml version=“1.0” encoding=“UTF-8”?>
<smbmeta version=“0.9” xmlns=“http://www.smbmeta.org/namespace/v0.9”>
<name>Concord Eggplant Restaurant</name>
<description>Innovative vegetarian food for young and old.</description>
<type naics=“722110”>Vegetarian restaurant</type>
<location country=“us” postalCode=“01742”>
<address href=“http://www.concordeggplant.com/directions.html”>300 Baker Avenue</address>
<hours day=“all” open=“1130” close=“2130” timezone=“local” />
<parking type=“on-street”>Lots of metered on-street parking</parking>
<publicTransportation type=“train” blocksAway=“3”>Five minute walk from the commuter rail</publicTransportation>
<publicTransportation> element showcases a nice balance between natural-language descriptions and a controlled vocabulary of tag attributes. With transportation types strictly limited to a fixed selection of *bus, train, subway, trolley, cable-car,* or *other*, and the blocksAway attribute limited to integer numbers, this data can easily be parsed, indexed, and searched against. (It’s also immune from poetic hyperbole.) The machine-friendly info is supplemented by the element’s contents, “Five minute walk from the commuter rail,” a bit of information tailored for the human reader.
Having seen some code, you might wonder what’s particularly novel, here. After all, SMBmeta’s XML structure is rudimentary, simple enough to be read (and hand-edited) by humans. The rather meager collection of tags, with names like <Location> and <LanguageSpoken> can be understood at face value. Furthermore, you don’t need to register your SMBmeta file with any Web service or registry — just upload it to your domain. For brand-new technology (SMBmeta was launched in 2003) this looks like code five years outta fashion. There’s no complicated RDF syntax, no small-biz ontology, no Dublin Core vocab — almost no fancy footwork at all.
And that’s exactly the point.
SMBmeta’s creator, Dan Bricklin, hasn’t been shy about citing the RSS (Really Simple Syndication) format as an inspiration; he’s hoping a similar grassroots, bottom-up adoption of SMBmeta will lend critical mass to the format. We Webmonkeys may be in the tutorial business, but we’ll readily admit that “View Source” can be the most edifying tutorial of all; SMBmeta knows this, too, and is engineered so that middling HTML hackers can cut, paste, and tweak our way into this technology without any tools other than a text editor and the (mercifully short) spec sheet.
But even though SMBmeta emulates Fisher-Price’s focus on user-friendly design, it’s not entirely painless to implement, and purposely so. Recalling the ignominious fate suffered by the “META keyword” tag at the hands of spammers, SMBmeta developers tossed a tiny hurdle into the setup process: the SMBmeta XML file must live at the very top (root) level of the domain, and you’re allowed just one SMBmeta file per domain. This restriction discourages spammers from flooding or shotgunning the system, since registering a multitude of domain names costs a fair chunk of change. There’s other spam-proofing features built into the foundations of SMBmeta, like pointers to third-party “affirmation” authorities, who certify that your descriptions match your website and real-world offerings, and who blacklist offending entries (and their entire domains, for that matter). If the cat-and-mouse arms race between spammers and searchers intrigues you, this essay outlines the anti-spam groundwork performed by the SMBmeta folks.
Obviously, SMBmeta is a well-thought-out format, and one that addresses a real need in the Small Biz community. It’s new, though, so it’s still too early to know if it will succeed. SMBmeta faces the same chicken-and-egg conundrum as most other metadata efforts: It’s unlikely to gain search-engine support before it’s widely adopted, but unlikely to be widely-adopted before it gains search-engine support.
But c’mon, why not take the leap and build a file for your business? (When it does take off, you’ll be leading the pack.) Instead of a tag-by-tag tutorial to get you started, we’ll point you at SMBmeta.org’s Web-based form, which will spit out skeletal SMBmeta file for you in just a few minutes. Using the spec, you can quickly add any accoutrements and double-check your work. Then, after uploading your file to your own domain, we’d suggest you register your site with Overall.com, a fledgling SMBmeta directory and search engine.
Now, on to Dublin Core, the super-flexible, way-modular, librarian-friendly metadata vocabulary that only *sounds* like an Irish heavy metal band.
Dublin Core Curriculum
Back in 1995, a motley crew of 100-odd software engineers, librarians, and Web architects held a workshop in Dublin, Ohio, to tackle a familiar problem: the difficulty of finding stuff on the Web. (And this was back when the web contained a cute half-a-million documents, today it’s in the billions.)
The result of the brainstorming session was a core set of 15 metadata elements, designed to describe any resource (like Web pages, images, music) available via the Web or other networks. The collection was dubbed “The Dublin Core.” These are the elements:
- Title — A name given to the resource.
- Creator — Who/what is primarily responsible for making the content.
- Subject — The topic of the content.
- Description — Gives an account of the resource’s content.
- Publisher — entity responsible for making the resource available.
- Date — Typically, the creation or availability date.
- Language — The language.
- Contributor — An entity that’s made contributions to the content.
- Source — A reference to a resource from which the present resource is derived.
- Format — The physical or digital manifestation of the resource.
- Type — The nature or genre of the content.
- Resource identifier — An unambiguous reference to the resource, like a URL, URI, or ISSN#.
- Relation — A reference to a related resource.
- Coverage — The extent or scope of the content (e.g., a place or time).
- Rights management — Information on Intellectual Property Rights, Copyright, etc.
You can see bibliographic influences in this vocabulary, but it’s still simpler than the heavyweight record-keeping languages used by professional catalogers for libraries and museums. With an eye towards flexibility and ease-of-use, Dublin Core allows everything to be optional — you can use as few or as many of the elements as you wish. Additionally, the Dublin Core team avoided wasting time talking about syntax and other implementation details, at least at first. (Encoding Dublin Core information in HTML META tags is now detailed here, though this more modern working draft may now be a better guide.) In short, tags look like this:
<meta name=“dc:element” content=“Value” />
<meta name=“dc:title” content=“Webmonkey” />
<meta name=“dc:date” content=“2003-06-20” />
Fast-forward to 2003, and Dublin Core hasn’t impacted the life of the average websurfer much. Mainstream search engines don’t give much weight to DC META tags, nor did commercial sites or homepages ever really bother to include Dublin Core markup in their pages. Despite their ignorance of “resource description languages” (or, arguably, because of it) brute-force, free-text search applications like Google reign unthreatened as the kings of Web research. It’s tempting, therefore, to write off the Dublin Core effort.
Of course, it’s also tempting to write off Carrot Top as a comedian. Yet between 1-800-CALL-ATT commercials and Hollywood Squares spots, it’s evident that the foul humorist’s career doth liveth still, finding new life on radio and TV airwaves. And so it is with Dublin Core.
See, Dublin Core elements have a surprising habit of making cameo appearances in other metadata frameworks. In fact, it’s in this other context that you’re most likely to find tell-tale terms like “
DC.title” and “
DC.subject” today. Dublin Core elements are regularly used as building blocks within richer and more specific metadata frameworks. This way, even if a spider doesn’t understand the entirety of a metadata language, it can still recognize the lowest-common-denominator DC objects, making the Dublin Core a sort of lingua franca among different metadata languages. Big organizations (like university library systems) especially rely on Dublin Core to enable searching across heterogeneous databases.
And you? Remember that code needed to get listed in GeoURL?
<meta name=“geo.position” content=“41.8833; 12.500” />
<meta name=“DC.title” content=“Professor Falken’s weblog” />
That first line is specific to GeoURL’s crawler, but the second line is a generic expression of the Dublin Core “Title” element. By adding GeoURL code to your page, you’ve also made it possible for any Dublin-Core-savvy spider or agent to identify the title of your work. (It also makes it easier for human programmers to recognize the meaning of that metada, even if they don’t fully understand the description model being used.)
So what other sorts of metadata frameworks are there? Let’s take a look at the big one, a general-use framework intended to describe anything and everything on the Web, RDF.
Resource Description Framework
RDF stands for “Resource Description Framework,” and like other metadata we’ve examined so far, it’s just another way of describing resources on the Web. RDF, however, is an official initiative of the W3C, the same folks who wrote the specifications for HTML, XML, and CSS. (*Nobody* breeds prize-winning acronyms like the W3C.)
Apart from its esteemed pedigree, RDF is remarkable for its large scope: It was designed to be a super-encompassing framework providing interoperability between different types of metadata. RDF can describe a single Web page, but also inter-relationships between a page and other resources on the Web. Likewise, RDF-crawling applications can do more than just parse one page’s worth of metadata — they can independently follow links to other metadata resources, placing things within a larger context. Even if an RDF agent wasn’t originally designed to handle the kind of metadata on your page, it may be able to automatically “learn” enough to process it meaningfully anyhow.
RDF has a rep for being academic and hard to understand. In truth, advanced RDF gets almost philosophically abstract, not to mention technically tricky. But the basics aren’t bad at all.
At heart, RDF is just a list of sentence-like assertions, or “Statements.” Like this here:
(This article) (is authored by) (Jason Cook)
(subject) (predicate) (object)
Happily, that’s as complicated as any single RDF statement gets. Every statement *must* follow that simple, three-part structure of Subject, Predicate, and Object; because of this, RDF statements are often referred to as “triples.” Witty, neh?
You’ll notice that statements always describe the relationship (the predicate) between the subject and object, like in this triplet o’ triples:
(This article) (is authored by) (Jason Cook)
(Jason Cook) (has email) (firstname.lastname@example.org)
(Jason Cook) (has homepage) (www.jasoncook.com)
On a cocktail napkin, we’d graph those relationships out like so:
[image lost to time]
Well, turns out my sketch above isn’t only cute-as-a-button, but it actually represents a directed graph, a type of mathematical model that’s easily traversed by computer algorithms, and which scales well to millions of nodes. That’s good news for agents like search-and-retrieval spiders.
Apart from illustrating the flow of direction from Subject to Object, you’ll notice that every arrow in my sketch is also attatched to a URL. That’s important. In fact, it’s downright key. Because while you and I already have a shared notion of what the “Is Authored By” relationship implies, a computer doesn’t.
Having a URI associated with each arrow (each predicate) allows software agents to follow links when they need more information about the properties of a relationship. For example, you could provide guidelines saying, “watch out, neither ‘January 23rd, 1971’ nor ‘Sinatra_MyWay.MP3’ is a plausible Author”.
One common way of doing this is by putting a ‘schema’, in the form of a XML namespace document, at the URI. Obviously, a schema can’t give sci-fi-type Artificial Intelligence to any crawler that visits it, but it can list useful rules like “Authors sometimes have email addresses” to aid data-gathering.
Thankfully, you don’t have to code all this by yourself! A big benefit of using URI-based schemas is that you can piggyback off of previous work on the web, and refer to terms already defined by others. For instance, if you include concepts like ‘Author’ and ‘Title’ in your metadata, might as well link to a common schema like Dublin Core to define those terms for you.
Another benefit of tacking URIs onto every relationship expressed within an RDF file is that it eases those awkward moments when people insist on using different vocabularies to describe the same stuff.
Let’s say my metadata uses:
( ThisArticle )( isAuthoredBy )( Jason ) … while most folks prefer …
( ThisArticle )( DC:Creator )( Jason ) With RDF, I can append machine-readable instructions to the
(isAuthoredBy) URI which explains, “
(isAuthoredBy) is equivalent to
(DC:Creator)”. Theoretically, that’s enough for a clever Dublin-Core-aware agent to translate and processs my metadata.
Before we get too abstract, let’s see some code. Here’s our sketch example, in RDF-XML syntax:
<dc:creator rdf:resource=“#Jason" />
Here’s a play-by-play commentary of what’s going on in the code above:
- Get the party started with the opening <rdf:RDF> tag.
- Inside the opening tag, we list URIs for three schemas used in this document. The first, the RDF schema, is always required. The second is the Dublin Core vocabulary, ideal for describing publications. The third is the Friend Of A Friend vocabulary, a set of terms that describe people.
- We start describing our resource with URI (Webmonkey/thisarticle) i.e., this article.
- We say the resource has a title, “Metadata Redux.” Recall that big to-do about needing a URI associated with every relationship? By using the <DC.title> tag, not native to RDF, but added via the Dublin Core namespace, we’ve implicitly named the Dublin Core schema as the URI containing details of what the “DC.title” relationship means.
- Ditto with DC:creator, though instead of giving a value, we reference a more detailed description (‘#Jason’) a few lines down.
- Using the FOAF vocabulary, we describe the resource “Jason,” identifying it as a Person, and then give him an Email and Homepage. Again, the semantics of all these relationships are in the FOAF schema listed up top.
By the way, a nice thing about coding RDF’s relational model is that it doesn’t much matter what order stuff gets listed in — it’s the (metaphorical) bubbles, boxes, and arrows of relationships that counts.
Phew! If that seemed intimidating, know this: RDF-XML code has a reputation for looking gnarly. Some RDF proponents argue that end-users shouldn’t worry about code or syntax, because one day, RDF will be hidden into tools like Dreamweaver, Word, or MovableType. That’s a debatable defense, but we can already point you to one such no-brainer tool that paints your biographical portrait in RDF, using Friend Of A Friend vocabulary. It’s kinda fun, and you won’t need to type a line of code, promise.
Friend Of A Friend does for humans what SMBmeta does for small businesses: It provides a metadata vocabulary that’s excellent for describing a specific thing — in this case, People.
Another benefit of FOAF is that it’s an application of RDF. Leveraging RDF’s proclivity for expressing relationships, FOAF links your profile in with that of your coworkers, your friends, and other on-line communities. (And, yes, this technology can be utilized to share pictures of cats.)
Like we said earlier, you don’t need to code anything to build a basic FOAF file. A quick visit to the FOAF-A-Matic website can generate one for you, Jetsons-style. After that, just upload it to your website, optionally adding a self-discovery link in your site’s <head>. Browsers like FOAF Explorer and FOAFnaut will help you visualize and hop around the FOAF universe, and they’re handy for validating your code, too.
‘Course, the terminally-curious among you (anybody still sticking around) probably won’t be satisfied without knowing a smidgeon more about what you can do with FOAF. For starters, you can slather on as much metadata from other vocabularies as you wish.
For instance, here’s a FOAFy file which talks about somebody’s location, but in two different ways: First, it uses a property from the FOAF vocabulary called ‘based_near’, which itself uses part of a geographic vocabulary called ‘geo’. Second, it tacks on ‘NearestAirport’ data from a completely different vocabulary called ‘contact’, which in turn relies on a more-specific ‘Airport’ vocabulary. (I point to the *xmlns:geo*, *xmlns:contact*, and *xmlns:airport* namespaces up top, so that RDF crawlers can understand what I’m talking about when I specify lattitudinal geographic coordinates, or the three-character Airport codes.)
<geo:Point geo:lat="41.8833" geo:long="12.5”/>
RDF’s core extensibility adds a touch of free-market flair: As new schemas and vocabularies become popular, they can be easily added to your FOAF without breaking the file.
But really, what’s the point of all this? Is this not just the mortal sin of vanity, marked up in XML?
Tough call. At the moment, FOAF suffers the same Adoption vs. Support catch-22 that hampers most new technologies. (And most new technologies bite the dust.) Like the early hypertext Web, however, a community of individuals is convinced there’s something intrinsically nifty about this technology, and they’re determinedly tinkering away on it, releasing homespun apps like FOAFexplorer, FOAF: web view, FOAFbot, and others.
Some have proposed “Web of trust,” community profiling, and anti-spam applications for FOAF, but these are, by and large, still experimental.
As is the Semantic Web itself.
*Originally published on Webmonkey.com, August 2003. Page updated Sept. 2008.*