Tag Archives: rant

The (s|S)emantic (w|W)eb

“The semantic web is the future of the web and always will be”

Peter Norvig, speaking at YCombinator Startup School

I’m sick of Semantic Web hype from people who don’t understand what they are talking about. In the past I’ve often said <insert Semantic Web rant here> – now it’s time to write it down.

There’s two things people mean when they say the “semantic web”. They might mean the W3C vision of the “Semantic Web” (note the capitalization) of intelligent data, usually in the form of RDF, but sometime microformats. Most of the time people who talk about this aren’t really having a technology discussion but are attempting a religious conversion. I’ve been down that particular road to Damascus, and the bright light turned out to be yet another demonstrator system which worked well on a very limited dataset, but couldn’t cope with this thing we call the web.

The other thing people mean by the “semantic web” is the use of algorithms to attempt to extract meaning (semantics) from data. Personally I think there’s a lot of evidence to show that this approach works well and can cope with real world data (from the web or elsewhere). For example, the Google search engine (ignoring Google Base) is primarily an algorithmic way of extracting meaning from data and works adequately in many situations. Bayesian filtering on email is another example – while it’s true that email spam remains a huge problem it’s also true that algorithmic approaches to filtering it have been the best solution we’ve found.

The problem with this dual meaning is that many people use it to weasel out of addressing challenges. Typically, the conversation will go something like this:

Semantic Web great, solve world hunger, cure the black plague bring peace and freedom to the world blah blah blah…

But what about spam?

Semantic Web great, trusted data sources automagically discovered, queries can take advantage of these relationships blah blah blah…

But isn’t that hard?

No, it’s what search engines have to do at the moment. The semantic web (note the case change!) will also extract relationships in the same way.

So.. we just have to mark up all our data using a strict format, and then we still have to do the thing that is hard about writing a search engine now – spam detection.

Yes, but it’s much easier because the data is much better.

Well, it’s sort of easier to parse, and in RDF form it is more self descriptive (but more complicated), but that only helps if you trust it already.

Well that’s easy then – you only use it from trusted sources

Excellent – lets create another demo system that works well on limited data but can’t cope with this thing called the web.

Look – I don’t t think the RDF data model is bad – in fact, I’m just starting a new project where I’m basing my data model on it. But the problem is that people claim that RDF, microformats and other “Semantic Web” technologies will somehow make extracting infomation from the web easier. That’s true insofar as it goes – extracting information will be easier. But the hard problem – working out what is trustable and useful – is ignored.

The Semantic Web needs a tagline – I’d suggest something like:

Semantic Web technologies: talking about trying to solve easy problems since 2001.

RDF could have one, too:

RDF: Static Typing for the web – now with added complexity tax.

So that’s my rant over. One day I promise to write something other than rants here – I’ve actually been studying Java versions of Quicksort quite hard, and I’ve got some interesting observations about micro optimizations. One day.. I promise…