The (s|S)emantic (w|W)eb

“The semantic web is the future of the web and always will be”

Peter Norvig, speaking at YCombinator Startup School

I’m sick of Semantic Web hype from people who don’t understand what they are talking about. In the past I’ve often said <insert Semantic Web rant here> – now it’s time to write it down.

There’s two things people mean when they say the “semantic web”. They might mean the W3C vision of the “Semantic Web” (note the capitalization) of intelligent data, usually in the form of RDF, but sometime microformats. Most of the time people who talk about this aren’t really having a technology discussion but are attempting a religious conversion. I’ve been down that particular road to Damascus, and the bright light turned out to be yet another demonstrator system which worked well on a very limited dataset, but couldn’t cope with this thing we call the web.

The other thing people mean by the “semantic web” is the use of algorithms to attempt to extract meaning (semantics) from data. Personally I think there’s a lot of evidence to show that this approach works well and can cope with real world data (from the web or elsewhere). For example, the Google search engine (ignoring Google Base) is primarily an algorithmic way of extracting meaning from data and works adequately in many situations. Bayesian filtering on email is another example – while it’s true that email spam remains a huge problem it’s also true that algorithmic approaches to filtering it have been the best solution we’ve found.

The problem with this dual meaning is that many people use it to weasel out of addressing challenges. Typically, the conversation will go something like this:

Semantic Web great, solve world hunger, cure the black plague bring peace and freedom to the world blah blah blah…

But what about spam?

Semantic Web great, trusted data sources automagically discovered, queries can take advantage of these relationships blah blah blah…

But isn’t that hard?

No, it’s what search engines have to do at the moment. The semantic web (note the case change!) will also extract relationships in the same way.

So.. we just have to mark up all our data using a strict format, and then we still have to do the thing that is hard about writing a search engine now – spam detection.

Yes, but it’s much easier because the data is much better.

Well, it’s sort of easier to parse, and in RDF form it is more self descriptive (but more complicated), but that only helps if you trust it already.

Well that’s easy then – you only use it from trusted sources

Excellent – lets create another demo system that works well on limited data but can’t cope with this thing called the web.

Look – I don’t t think the RDF data model is bad – in fact, I’m just starting a new project where I’m basing my data model on it. But the problem is that people claim that RDF, microformats and other “Semantic Web” technologies will somehow make extracting infomation from the web easier. That’s true insofar as it goes – extracting information will be easier. But the hard problem – working out what is trustable and useful – is ignored.

The Semantic Web needs a tagline – I’d suggest something like:

Semantic Web technologies: talking about trying to solve easy problems since 2001.

RDF could have one, too:

RDF: Static Typing for the web – now with added complexity tax.

So that’s my rant over. One day I promise to write something other than rants here – I’ve actually been studying Java versions of Quicksort quite hard, and I’ve got some interesting observations about micro optimizations. One day.. I promise…

By lowercasing the Semantic Web you are taking the semantics out of it: what Google search engine is doing does not have anything to do with the real semantics of the data under analysis. It is (as well as Bayesian filtering you’re putting close to it) a mere statistical analysis of the text, with regard to the internal markup structure behind the HTML, and the external structure of the web pages graph made of web pages and connected with links.

I bet you knew this already, but still.

Important thing about the semantic web being lowercased, is that there were two groups of people dealing with it: W3C comittee members, and a mere web developers.

First were saying: “HTML does not contain any semantic structure in it, only the markup, so let’s create a parallel universe, where the real semantic structure for every web site is encoded in the corresponding RDF”.

Second answered: “It is only part of the truth, that HTML does not contain any semantic structure; indeed it was developed with the structure in mind, while the CSS was responsible for how things looked.”

The meaning here is that when a mere web developer uses ul/li tags combination this is almost always a list. And when you have a CSS class with name ‘address’ assigned to it, the list is almost always intended to represent an address. So, on one hand you may define the CSS version of the ‘address’ class to render HTML representation of address as you pleased, and on another you may extract (tada!) semantics from the ul/li combination (namely, the value of someone’s address). This is what microformats (or lowercased semantic web, but NOT the Google’s search) are all about.

The Semantic Web is a thing that got too much publicity. Along with semantic nets, frame hierarchies, and description logics, it tried to deal with the problem of knowledge representation and processing. (Up to date) it was unable to solve any major real-world problem, but it’s OK for ongoing research initiative! As well as it is OK for a simple technical solution to the complex problem to work almost always (I am talking about semantic web/microformats now).

One thought on “The (s|S)emantic (w|W)eb”

Sergey Mikhanov says:

October 20, 2008 at 8:26 am

By lowercasing the Semantic Web you are taking the semantics out of it: what Google search engine is doing does not have anything to do with the real semantics of the data under analysis. It is (as well as Bayesian filtering you’re putting close to it) a mere statistical analysis of the text, with regard to the internal markup structure behind the HTML, and the external structure of the web pages graph made of web pages and connected with links.

I bet you knew this already, but still.

Important thing about the semantic web being lowercased, is that there were two groups of people dealing with it: W3C comittee members, and a mere web developers.

First were saying: “HTML does not contain any semantic structure in it, only the markup, so let’s create a parallel universe, where the real semantic structure for every web site is encoded in the corresponding RDF”.

Second answered: “It is only part of the truth, that HTML does not contain any semantic structure; indeed it was developed with the structure in mind, while the CSS was responsible for how things looked.”

The meaning here is that when a mere web developer uses ul/li tags combination this is almost always a list. And when you have a CSS class with name ‘address’ assigned to it, the list is almost always intended to represent an address. So, on one hand you may define the CSS version of the ‘address’ class to render HTML representation of address as you pleased, and on another you may extract (tada!) semantics from the ul/li combination (namely, the value of someone’s address). This is what microformats (or lowercased semantic web, but NOT the Google’s search) are all about.

The Semantic Web is a thing that got too much publicity. Along with semantic nets, frame hierarchies, and description logics, it tried to deal with the problem of knowledge representation and processing. (Up to date) it was unable to solve any major real-world problem, but it’s OK for ongoing research initiative! As well as it is OK for a simple technical solution to the complex problem to work almost always (I am talking about semantic web/microformats now).

BadMagicNumber

My Blog, Take 4

One thought on “The (s|S)emantic (w|W)eb”

Leave a Reply