Category Archives: tech

The (s|S)emantic (w|W)eb

“The semantic web is the future of the web and always will be”

Peter Norvig, speaking at YCombinator Startup School

I’m sick of Semantic Web hype from people who don’t understand what they are talking about. In the past I’ve often said <insert Semantic Web rant here> – now it’s time to write it down.

There’s two things people mean when they say the “semantic web”. They might mean the W3C vision of the “Semantic Web” (note the capitalization) of intelligent data, usually in the form of RDF, but sometime microformats. Most of the time people who talk about this aren’t really having a technology discussion but are attempting a religious conversion. I’ve been down that particular road to Damascus, and the bright light turned out to be yet another demonstrator system which worked well on a very limited dataset, but couldn’t cope with this thing we call the web.

The other thing people mean by the “semantic web” is the use of algorithms to attempt to extract meaning (semantics) from data. Personally I think there’s a lot of evidence to show that this approach works well and can cope with real world data (from the web or elsewhere). For example, the Google search engine (ignoring Google Base) is primarily an algorithmic way of extracting meaning from data and works adequately in many situations. Bayesian filtering on email is another example – while it’s true that email spam remains a huge problem it’s also true that algorithmic approaches to filtering it have been the best solution we’ve found.

The problem with this dual meaning is that many people use it to weasel out of addressing challenges. Typically, the conversation will go something like this:

Semantic Web great, solve world hunger, cure the black plague bring peace and freedom to the world blah blah blah…

But what about spam?

Semantic Web great, trusted data sources automagically discovered, queries can take advantage of these relationships blah blah blah…

But isn’t that hard?

No, it’s what search engines have to do at the moment. The semantic web (note the case change!) will also extract relationships in the same way.

So.. we just have to mark up all our data using a strict format, and then we still have to do the thing that is hard about writing a search engine now – spam detection.

Yes, but it’s much easier because the data is much better.

Well, it’s sort of easier to parse, and in RDF form it is more self descriptive (but more complicated), but that only helps if you trust it already.

Well that’s easy then – you only use it from trusted sources

Excellent – lets create another demo system that works well on limited data but can’t cope with this thing called the web.

Look – I don’t t think the RDF data model is bad – in fact, I’m just starting a new project where I’m basing my data model on it. But the problem is that people claim that RDF, microformats and other “Semantic Web” technologies will somehow make extracting infomation from the web easier. That’s true insofar as it goes – extracting information will be easier. But the hard problem – working out what is trustable and useful – is ignored.

The Semantic Web needs a tagline – I’d suggest something like:

Semantic Web technologies: talking about trying to solve easy problems since 2001.

RDF could have one, too:

RDF: Static Typing for the web – now with added complexity tax.

So that’s my rant over. One day I promise to write something other than rants here – I’ve actually been studying Java versions of Quicksort quite hard, and I’ve got some interesting observations about micro optimizations. One day.. I promise…

The problem with OpenID is…

The problem with OpenID is branding – people get (very) confused when they get taken off site to login. I’ve watched usability testing of this, and it is truly horrible. Obviously this isn’t unique to OpenID – it applies equally to any federated identity solution (in fact – Shibboleth based federations are even worse than OpenID in this respect).

I think user education will help, but it would be really good to be able to extend OpenID to be able to put a logo on the identity provider’s site so the user can see they are logging into site “blah” via whatever open id provider.

Dataportability: Did anyone ask the users? – Part 2

I got a bit of feedback on my previous post about dataportability. The general gist was that because you can move your contacts from one email system to another (or export them) then data portability is good.

I’m not sure I agree. I think that joining a new social application and automatically finding existing contacts on that system is functionality that is likely to cause problems for users.

Each social application is a different context and people use them in different ways. Mid last year I expressed my concerns about this on the Social Network Portability group like this:

Everyone’s heard the stories of how employers are checking out possible
employees on Facebook. This system will not only find them on
Facebook, but find their user id on that new Playboy social network
for college students (http://www.techcrunch.com/2007/08/22/new-playboy-
social-network-built-on-ning/
). That’s not a good thing to do to
people..

dahna boyd wrote about similar issues:

I lost control over my Facebook tonight. Or rather, the context got destroyed. For months, I’ve been ignoring most friend requests. Tonight, I gave up and accepted most of them. I have been facing the precise dilemma that I write about in my articles: what constitutes a “friend”? Where’s the line?

….

I know people generally believe that growth is nothing but candy-coated goodness. And while I hate using myself as an example (cuz I ain’t representative), I do feel the need to point out that context management is still unfun, especially for early adopters, just as it has been on every other social network site. It sucks for teens trying to balance mom and friends. It sucks for college students trying to have a social life and not piss off their profs. It sucks for 20-somethings trying to date and balance their boss’s presence.

Back then I was all over using bloom filters as a way of attempting to preserve people’s privacy. I’ve given that up now – it’s a nice hack but it doesn’t really fix anything.

Moving your email contacts between systems is fine for both parties because it’s the same context – email. Being linked to your boss on LinkedIn and having them automatically find you on a dating site you are both a member of is going to put a lot of users off.

Dataportability: Did anyone ask the users?

If you believe what you read on blogs, then dataportability is all peace and light and yayness. Unfortunately it seems someone forgot to tell the rest of the internets.

It seems to be that many people who want data portability are either not representative of the general population or have agendas involving trying to get as many people as possible onto their site.

This isn’t just a privacy concern: I continue to think that the difference in context between different social applications is a key constraint that people are missing.

I’m increasingly of the view that moving contacts from one application to another should require both parties to agree that that they want to be visible to each other in the new context. It isn’t clear to me that this a viable process, though.

(Something that should be obvious but apparently isn’t: portable APIs for web based social application – ie OpenSocial – are extremely good. Moving of personal data – ie, data portability – is something I think still is questionable)

Less than 3 months

Exhibit A:

Facebook will have a huge leak of personal private information. It will turn out to be due to buggy code, which will finally focus some attention on the fact that Facebook’s codebase appears to be really, really bad.

 Exhibit B:

The Associated Press reported this afternoon that its reporters were able to use an undisclosed method to access private photos on Facebook, including some from Paris Hilton at the Emmys and others from Facebook founding CEO Mark Zuckerberg’s vacation in November of 2005.

I still think there’s going to be worse lapses than this by the end of 2008.

Firefox 3 on Linux

I’ve been using Ubuntu at home on one of my computers for close to a year now. I’ve been pretty happy with it, although Gnome struggled on my computer (a circa 2003 Athlon). Switching to Xfce fixed that, and my one remaining problem was Firefox.

For those who haven’t tried Firefox 2 on Linux, it’s pretty bad. If leave a Javascript heavy site (eg GMail) open the browser will slowly grind to a halt over a course of a few hours.

I recently upgraded to Firefox 3 (see this video for how to do that), and it’s made a HUGE difference. The one issue I had was that I could get it to start – I hadn’t realized that the executable was now firefox-3.0 instead of firefox. Makes sense, though.

Why tech predictions are stupid (and a small prediction)

Every year hundreds of tech pundits go and make their predictions for the year – a trend I’m not immune to either. Alan Kay explained the problem with this the best: “The best way to predict the future is to invent it”. In a field like computing it is so easy for a single person to build something new it makes trying to make predictions a pointless Lose Weight Exercise.

None the less, here’s something that is less of a prediction and more an Lose Weight Exercise in deduction and rumor mongering. Sun is planning to launch a direct competitor to Amazon’s EC2 in the near future (not sure when exactly, but 2008 for sure). Note that this is different to the existing Sun Grid product (which will presumably continue).

Shipping software part 2

In shipping software I spoke briefly about me.edu.au (which is still taking a good amount of my time). Recently, though I’ve been spending a lot of time preparing education.au’s Java based federated search product (the Distributed Search ManagerOpenDSM) for release as an open source product. That’s been an interesting experience – the code is pretty old, and was glued together using static references. I had to pull it part, replace the static references with factories (changing to dependency injection wasn’t realistic for this release at least) and put it back together. It’s kind of odd working on a project like that – the code almost causes me pain at times, but with a product that is stable and reliable I don’t want to make too many changes just because I don’t like the style.

Some readers of this blog may be interested in it, because it allows federating of results from multiple Solr servers (or OpenSearch services) together into a single result set. That’s useful in quite a large set of places.

It’s also the first time I’ve been paid to create open source code as an explicit goal – most of my open source work has been for pragmatic reasons, not as a goal in itself.