April 07, 2004

Web Catigorization Scheme

One thing that the web can't seem to agree on is how to categorize its contents.

I've been building a fairly straight forward link management system. The plan was to build an application that could easily be integrated into a simple site, blog, or complex content management system: OR it could stand alone. I wanted a way to add, delete and edit links. I also wanted to be able to place those links into categories (not unlike yahoo or dmoz).

The first part of this application is done: I can add, delete, edit links.

The second part: the category manager, is designed (on paper). i.e. the logic is sorted out and the interface is sketched BUT I seem to have hit a roadblock.

At first I thought I would simply create an ad-hoc hierarchical category scheme. A way for users to name their own categories and decide where to place the links within their own categorization scheme. But, after thinking about that for a while I realized that it might not be the best approach. I wanted something a little more robust. What if people wanted to share their link databases? What if people wanted to compare link databases? If everyone uses their own ad-hoc link categories, the application wouldn't lend itself well to sharing of information.

So I thought I would try a different approach. I figured that I would simply find the most standard/universal categorization scheme used on the web and build my app based on that. That way, users of the application could easily share thier work with others and be fairly sure that their categorization schemes would be compatible.

The problem is that there isn't a standard/universal categorization scheme for the web.

WHAT!?!

That's what I said.

You see everyone seems to have a different idea of how the web should be categorized. Yahoo came up with thier own scheme. DMOZ came up with their own scheme. And everyone else came up with their own schemes.

While many of them are similar - they aren't based on any one standard way of categorizing information.

Then I thought - surely people would have looked to library science - Dewey came up with a scheme over 100 years ago and its still used all over the world today. Hasn't someone ported that to the web yet?

The answer is yes and no. Dewy Decimal Categorization (DDC) is often talked about - and looked to - but rarely implemented. One of the reasons is that its not in the public domain.... WHAT?!!?

That's what I said. DDC is controlled by a group and in order to use it you must pay a licensing fee. For that fee you get the right to use the name Dewey. You also get access to anual updates of the categories and corresponding numbers.

But didn't dewy die like a million years ago?

That's what I said.

He did - and technically you can use the scheme without paying a fee - but you can only use or make a derivitive work from a version that is 80 years old. Any updates or changes or advances in the system that have been made since then are covered under new copyrights and trademarks and (I would bet) patents.

Okay - so there must be some sort of alternative.

UDC - universal decimal categorization. Based on Dewey - its a more flexible more (duh) universal system with different rules on how to add contents to categories. But it too is only available for a fee.

Okay - so why not use something like DMOZ or Yahoo as the basis. Because they are ad-hoc systems and can change without notice. Categories get added, changed or dropped on a regular basis. Its not reliable. Its not standardized. Its not practicle.

So what are people doing about it.

Well - there is a large movement to creating a semantic web. The logic is that if people mark up their content with good descriptive meta data - the web organizes itself. The categories are self evident. The content itself tells you what it is. This allows you to organize the web any way you like. Everything can be sorted by the different facets that describe the content.

GREAT! But who decides how to describe the data? Who decides what constitues good meta data? Who decides how to mark up your data?

Right back at square one. Everyone could technically come up with their own metadata schemes and apply them to individual documents (links). Or everyone could come up with their own automated way of extracting meta data from documents (like search engines do).

Thankfully there are some standards emerging. RDF and Dublin Core are leading the pack. In fact they have been adopted by tons of people in the business of sorting massive libraries of files distributed across heterogenious internal networks and differnt machine types.

RDF describe how you mark up the documents. Dublin Core describes the language you should use.

Problem is that we are a long way off from a time when your typical web user is going publish anything that complies to these standards.

Okay so now what.

Well - there is still a lot of people that think that there is room for categorization based on schemes like DDC or UDC or whatever. After all, can't you have a hybrid of both. Neither one gets in the way of the other. In fact they are quite complimentary.

And until the next generation of the web with robust meta data is available - links can still be organized by a standard categorization scheme. And as the technology advances to automatically mark up docuemtns with robust meta data - it will certainly make it easier to do so as the categories themselves are a form of meta data. (i.e. part of the job will already be done).

Okay - so what am I supposed to do. I'm still searching - because I have to believe that there is an open source dirivitive version of DDC or UDC floating around on the web. There has got to be a database that exists that can be used as a starting point.

I refuse to re-invent the wheel. Library science has been at this categorization thing for a long time. Why should I be so vain as to think i can come up with something better on my own?

Next step: While thinking about all this and going down this path of zen learning, I've realized that there is a killer app for what I am doing. Think of it as a distributed DMOZ. An open source directory of the web based on a standerdized categorization scheme. Everyone is allowed to be an editor - but the democratic nature of the web would decide who's link collections get added to their own - and who's get excluded.

Every person that would use this link managment tool automatically provides and RSS feed of their link collection. Anyone can download anyone elses link collection and integrate it with their own if they like. Or you can simply browse a collection of link collections from a central point. Or whatever. The options are limitless.

I will add supporting links to this document and fix up the spelling errors later - but I thought I would publish this now.

Anyone want to help with this project?

andre

Posted by andre at April 7, 2004 07:28 AM