Archive for random

The (s|S)emantic (w|W)eb

“The semantic web is the future of the web and always will be”

Peter Norvig, speaking at YCombinator Startup School

I’m sick of Semantic Web hype from people who don’t understand what they are talking about. In the past I’ve often said <insert Semantic Web rant here> - now it’s time to write it down.

There’s two things people mean when they say the “semantic web”. They might mean the W3C vision of the “Semantic Web” (note the capitalization) of intelligent data, usually in the form of RDF, but sometime microformats. Most of the time people who talk about this aren’t really having a technology discussion but are attempting a religious conversion. I’ve been down that particular road to Damascus, and the bright light turned out to be yet another demonstrator system which worked well on a very limited dataset, but couldn’t cope with this thing we call the web.

The other thing people mean by the “semantic web” is the use of algorithms to attempt to extract meaning (semantics) from data. Personally I think there’s a lot of evidence to show that this approach works well and can cope with real world data (from the web or elsewhere). For example, the Google search engine (ignoring Google Base) is primarily an algorithmic way of extracting meaning from data and works adequately in many situations. Bayesian filtering on email is another example - while it’s true that email spam remains a huge problem it’s also true that algorithmic approaches to filtering it have been the best solution we’ve found.

The problem with this dual meaning is that many people use it to weasel out of addressing challenges. Typically, the conversation will go something like this:

Semantic Web great, solve world hunger, cure the black plague bring peace and freedom to the world blah blah blah…

But what about spam?

Semantic Web great, trusted data sources automagically discovered, queries can take advantage of these relationships blah blah blah…

But isn’t that hard?

No, it’s what search engines have to do at the moment. The semantic web (note the case change!) will also extract relationships in the same way.

So.. we just have to mark up all our data using a strict format, and then we still have to do the thing that is hard about writing a search engine now - spam detection.

Yes, but it’s much easier because the data is much better.

Well, it’s sort of easier to parse, and in RDF form it is more self descriptive (but more complicated), but that only helps if you trust it already.

Well that’s easy then - you only use it from trusted sources

Excellent - lets create another demo system that works well on limited data but can’t cope with this thing called the web.

Look - I don’t t think the RDF data model is bad - in fact, I’m just starting a new project where I’m basing my data model on it. But the problem is that people claim that RDF, microformats and other “Semantic Web” technologies will somehow make extracting infomation from the web easier. That’s true insofar as it goes - extracting information will be easier. But the hard problem - working out what is trustable and useful - is ignored.

The Semantic Web needs a tagline - I’d suggest something like:

Semantic Web technologies: talking about trying to solve easy problems since 2001.

RDF could have one, too:

RDF: Static Typing for the web - now with added complexity tax.

So that’s my rant over. One day I promise to write something other than rants here - I’ve actually been studying Java versions of Quicksort quite hard, and I’ve got some interesting observations about micro optimizations. One day.. I promise…

Comments

Podcasted

Does it make sense to say that I got podcasted?

Anyway, I did - or at least I had a good conversation when my network connection didn’t drop out. I haven’t listened to it yet - not sure I enjoy listening to myself.. But as I said on twitter - now I’m a legend in my own lunchbox.

Comments

Random stuff

It’s Friday afternoon, so here’s some random stuff:

  • We live across the road from a park, and most Saturday mornings some guy rides his bike there to do Yoga. He also brings his pet chicken to the park and lets it run around. (This might be normal behavior in San Fransisco or somewhere, but in suburban Adelaide it is kinda odd)
  • Alex is now 2, and doesn’t like sleeping at childcare. Fortunately, they have figured out that letting him sleep with a ladder (yes, a full size, aluminum ladder) will calm him down and get him to sleep.
  • Paul Keating - no matter if you loved him or hated him - had a unique way with words. From yesterday’s Financial Review: “When push came to shove, McGuiness’s journalism did not add up to a row of beans. He help more political, philosophic and economic positions than would have the Karma Sutra had it been a philosophic text“.
  • If you don’t program, and you write about the meaning of programming APIs then your opinion is moot. This also applies if you try and talk about APIs
  • The Moth is a cool boat, but has come a long way since my circa-1970 tunnel hulled version. It’s kind of weird that they banned tunnel hulls, but freaking hydrofoils are okay…

Comments (3)

Missing the real story on Ning statistics

Last week there was a bit of news traffic about some of the content that is on Ning. Whatever…

I think the really interesting story to come from those Quantcast stats (if you trust them) is the Share of Vists. 86% of the vists to Ning come from regular or addicted visitors?! That’s some pretty good stickiness.

Comments

Predictions for 2008

So it turns out that it’s 2008 and the thing to do is to do predictions for the next year. Here’s my 2:

  1. Facebook will have a huge leak of personal private information. It will turn out to be due to buggy code, which will finally focus some attention on the fact that Facebook’s codebase appears to be really, really bad.
  2. Someone will realize that recommendations are the next search. Some company will work out how to do for recommendations what Google did for search: ie, take what is currently an overly commercial medium (eg, Amazon recommendations etc) and turn it into a consumer facing tool which is generally useful. By 2010 what they did will seem obvious, and by 2011 they will be billionaires.

Update - 1 more thing:

OpenSocial will succeed in a big way - not because of support from the big players (Google etc) but because lots of small open source web projects (Wordpress, Drupal, Joomla etc) can easily add support and will finally have a standard way of creating cross-platform compatible software.

Comments

The Napster (Grockster?) of Facebook

IANAL, but how can Audibie possibly be legal? Since the doctrine of inducement appeared (ref Grockster) I can’t see how the DCMA safe-harbor provisions would save them. Perhaps they are relying on the fact that they don’t host the files themselves - although that didn’t save Grockster or Napster.

It’s interesting to think what Facebook’s liability would be over an application like this. Facebook have a currently have a copyright policy which passes responsibility for DMCA takedown requests onto the application author. Audibie have posted their takedown procedures, in accordance with the DCMA.

If I was Facebook I’d be pretty worried that might not be enough.

Comments

Solr + Hibernate

Solr is good software. Hibernate is good software, and with Hibernate Search it uses Lucene for full text search.

It’s possible to configure Solr to use arbitrary Lucene indexes. I think it would be great if someone (else!) would do the work to configure Solr to work with Hibernate Search.

Comments (1)

Quick & Dirty Server Monitoring

Sometimes it’s difficult to setup Nagios for server monitoring. This is what I do instead.

Firstly, for load monitoring:


#!/bin/bash

FILENAME=< absolute path >/monitoring/logs/load-$(date +%Y%m%d).txt

cat /proc/loadavg | awk '{print strftime("%Y/%m/%d %H:%M:%S", systime()), $1, $2, $3}' >>  $FILENAME

Run it both from cron, and then I use another cron script and gnuplot to graph the output.

genloadgraph.sh:



DATE=$1
if [ -z $DATE ]; then DATE="$(date +%Y%m%d)"; fi
FILENAME=load-$DATE.txt
cp < absolute path >/monitoring/logs/$FILENAME < absolute path >/monitoring/load.txt
gnuplot < absolute path >/monitoring/loadplot.p
rm < absolute path >/monitoring/load.txt

loadplot.p:


set terminal png large size 800,600
set xdata time
set timefmt "%Y/%m/%d %H:%M:%S"
set title "Load"
set format x "%H:%M:%S"
set out '< absolute path >/monitoring/load.png'
plot "< absolute path >/monitoring/load.txt" using 1:3 title '1 min average' with lines, "< absolute path >/monitoring/load.txt" using 1:4 title '5 min average' with lines, "< absolute path >/monitoring/load.txt" using 1:5 title '15 min average' with lines
set output

Gives a graph like this:

Load Graph

It possible to do a similar thing for website monitoring:



#!/bin/bash

FILENAME=< absolute path >/monitoring/logs/nicklothian-$(date +%Y%m%d).txt
(time wget -q --delete-after http://nicklothian.com/blog/) 2>&1 | awk '/real/ {print strftime("%Y

/%m/%d %H:%M:%S", systime()), $2}' >> $FILENAME

Comments (1)

Recommendations for Australian Contractor service companies?

I’m currently employed as a contractor, which means I need to have my own company (I operate as a sole trader). For a variety of reasons this sucks, and I’m interested in any recommendations for companies which act as contractor shell companies (I’m not sure what the proper terminology is). What I’m looking for is a company which employs me as an employee, and the company I actually work for pays my contract rate to. Then the shell company handles all the tax obligations, etc.

I’ve heard of a few companies in Australia which do this, but the only one I remembered to save is Entity Solutions. Anyone got any other recommendations (or experience with them)?

Comments (1)

Run your own Jabber server and federate with GTalk

So it turns out that my hosting provider offers the ability to run your own Jabber server. Here’s what I did to get this working.

Read the rest of this entry »

Comments