All posts by Nick Lothian

The "Bloglines Index is worth lots" meme

John Battelle (and others) seem to think that Blogline's index of 280 million blogs posts is one of it's most valuable assets.

While there's no doubt that much data is worth something I doubt any search company would find it very valuable. After all:

  1. They already have the data in their main index (in the form of the pages it came from) and (more importantly)
  2. Blogging (and related applications) is concerned with the most recent, up to date data possible.

Old blog posts do add depth, but nothing like the amount a real search engines index can add.

On the other hand the detailed data about a million user's interests, and information about who created what content…. that is something worth paying for.

Classifier4J 0.6 released

I've just released Classifier4J 0.6. This new release includes a rather nice (I think) new classifier (the VectorClassifer) based on the vector space search algorithm This particular classifier is fast, doesn't require training for non-matches and is very suitable for sorting data into various categories.

If you've looked at Classifier4J in the past and run into performance problems with the Bayesian algorithm I'd be interested in your feedback on this new algorithm.

Details are available from the Classifier4J website

Re: More on "XML over HTTP"

Ted Neward didn't seem to like my All Web Services should be run over HTTPS post. He rightly points out that HTTPS isn't a complete solution for web service security.

I can't be as good a writter as Ted is, because he's totally missed the point of my original post. I'm not recommending using Webservices over HTTPS because of the transport security benefits (although these exist) but because it stops tampering by well intentioned but badly designed intermediaries. Ted later points out that

An “intermediary” that wants to act on the payload isn't really an intermediary anymore, but a processing node in its own right that participates in a workflow chain. An intermediary certainly has the right and responsibility to affect the message headers, but not the payload itself. To say that SSL provides the “benefit” of preventing well-meaning intermediaries from doing this is to hide the ill-behaved nature of the intermediary itself, and doesn't properly address the problem.

This is true of course. Unfortunately in the real world clients have a habit of doing things like (as a totally imaginary example) installing badly behaved firewalls and then insisting that software that breaks because of them be “fixed” even AFTER Checkpoint admits the faults in their hardware.

Ted also says:

I'm sorry you got sold the bill of goods that said that “XML over HTTP” was supposed to be easy–it's only easy so long as you did simple things with it.

Actually that's the point. This was pretty much as simple as it could possibly be: .NET calling a RPC style Java based web service.
It really should have “just worked”. The only complexity in the system (.NET <-> Java type mappings) wasn't what caused it to break.

I'm convinced that this failure mode (people doing stupid things) breaks more enterprise systems than things that are typically planned for (hardware failure etc).

Disambiguation in Folksonomies

Tim Bray's post on Technorati Tags highlighted
that it wasn't just me that though disambiguation
in folksonomies
is important.

I spent a while thinking about it, and I came up with a half-backed scheme where uses could qualify their tags with a
parent tag, and an elaborate justification for why this was okay to do.

Then I realised that the disambiguation data is already there in the form of additional tags on the same data item.

In Del.icio.us you can tag a page with more than one tag. If the user interface presented each one of these tags as
related concepts in a dynamically constructed hierarchy then it would make browsing simple.
Query based access could be done by simple Boolean joins.

In Tim Bray's example, he wants to disambiguate Petroleum+Geology->Drills and Military+Training->Drills by using different
classification schemes. My current job is in the digital library area, so I, too feel the pull of well defined classification
schemes.

However, I think a better (in the Web 2.0, loosely bound sense) way to disambiguate would be to
have a service that returns data for a query like:

(Petroleum+Geology AND Drills)

This query would return all records that have been tagged with both
Petroleum+Geology AND Drills tags.

But what about Drills records that should be Petroleum+Geology records but haven't been tagged as such?
Well now you get a choice. You can have highly specific data that you know is correct by ignoring those records, or else
you can offer lower quality data but still try to remove data you know to be irrelevant:

Drills NOT (Military+Training AND Drills)

Browsing would use the same kind of queries. For instance, look at the Del.icio.us
programming tag. Currently the first item is tagged as
python, strings and programming. Obviously these concepts must be linked in
some way, so they should be presented as such in the user interface:

-->programming
programming > python
programming > strings

Now, if you go to programming > python you will get a dynamic page constructed using the query:

programming AND python

This is useful because it will remove all pages about python snakes from view.

The user interface would now change again:

programming
-->programming > python
programming > python > strings

Going to the programming > python > strings page will get you the data from the query:

programming AND python AND strings

I think this idea would provide quite a useful advancement on the current tagging mechanisms used in Folksonomies.
Now I just need to build this thing..

UPDATED: As it happens Del.ici.us already has support for querying on intersection of tags using the tag1+tag2 syntax. del.icio.us/tag/programming+python is the python programming tag.

Technorati Tags

Technorati Tags are cool, but I wonder how they will disambiguate (is that a word) them?

For instance, the Java Tag shows pictures of coffee, has blogs about programming and is only saved from a some fairly heated political discussions by the fact there are few Indonesian bloggers.

There are (clustering) algorithms that will solve this for you, but Technorati doesn't appeare to be using them at the moment.

The Java Co-Processor

Azul Systems is set to release a “Network attached processor” for Java applications.

The key to network attached processing is Azul virtual machine proxy technology. This patent-pending technology, initially targeted at Java and J2EE platform-based applications, transparently redirects application workload to the compute pool. No changes are required to applications, or the existing infrastructure configuration. The Azul technology works with J2EE platform products including BEA® WebLogic® and IBM® WebSphere® application servers. Compute pool appliances are simply connected to the network and Azul software is installed on the application hosts. Suddenly every application has access to a virtually unlimited set of compute resources.

Each compute pool consists of two or more redundant compute appliances—devices designed solely to run massive amounts of virtual machine-based workloads. Each appliance has up to 384 coherent processor cores and 256 gigabytes of memory packed in a purpose-built design that delivers the benefits of symmetric multiprocessing with tremendous economic benefits. The massive SMP capacity of these appliances enables applications to dynamically scale, responding to varying workload and spikes without the pain of having to reconfigure or provision application tier servers. The targeted design provides small unit size, high rack density, low environmental costs, and simple administration.

Azul Systems

According to The Register, Azul has a custom multicored processor, which contains 24 cores.