Archive for February, 2005

Free IntelliJ licences for all Apache committers

As part of their Open Source Licencing initiative, Jetbrains have donated IntelliJ licences to all Apache committers. Details at http://www.jetbrains.com/idea/opensource/asf.html.

Thanks to Jetbrains and Henri Yandell for setting this up.

Comments

Speaking of information visualization…

Network analysis of the Flickr population, based on data collected on January 8th, 2005, and some additional analyses: FlickrLand

Comments

Re: RSS Aggregators are the killer app

Ted talks about how he expects RSS aggregators to start chewing CPU time.

I've done some experiments in this area, and Bayesian classification on 4000 items a day would currently be an interesting performance tuning exercise. In my experience it isn't CPU bound, though - it's I/O bound.

I have a few ideas about things that might perform better that Bayesian classification anyway, but these techniques (as well as things like Latent Semantic Indexing) will be more CPU hungry, though.

Everytime I think about trying to do LSI (or even Vector Space Search) on a couple of million items I start looking at the vector processor units on modern video cards and start drooling. Forget the CPU - off load that processing to the GPU. There still will be problems with disk and memory I/O, but the processing power is there.

(A couple of times I've actually began investigating this. It would be an excellent project to add GPU co-processing to Classifier4J and/or Lucene. JOGL may be the best way to do it.)

GPGPU.org is a decent site for more stuff about this.

Comments

Automatic Debugging

Ask Igor - and automated debugger.

Ask Igor

I haven't tried it, but it looks pretty impressive. It is based on the techniques outlined in the paper
Isolating Cause-Effect Chains from Computer Programs, which won the ACM SIGSOFT Distinguished Paper Award.

More details from the Delta Debugging page.

Comments

Java memory usage on Linux

Well that's just embarrassing.

Everytime I think that Java is making some gains on the client side something like that comes up and I realize that it is just that computers are getting faster. My rule with client side Java applications is still to assume they will suck and get the very occasional pleasant surprise.

Comments

The "Bloglines Index is worth lots" meme

John Battelle (and others) seem to think that Blogline's index of 280 million blogs posts is one of it's most valuable assets.

While there's no doubt that much data is worth something I doubt any search company would find it very valuable. After all:

  1. They already have the data in their main index (in the form of the pages it came from) and (more importantly)
  2. Blogging (and related applications) is concerned with the most recent, up to date data possible.

Old blog posts do add depth, but nothing like the amount a real search engines index can add.

On the other hand the detailed data about a million user's interests, and information about who created what content…. that is something worth paying for.

Comments

Classifier4J 0.6 released

I've just released Classifier4J 0.6. This new release includes a rather nice (I think) new classifier (the VectorClassifer) based on the vector space search algorithm This particular classifier is fast, doesn't require training for non-matches and is very suitable for sorting data into various categories.

If you've looked at Classifier4J in the past and run into performance problems with the Bayesian algorithm I'd be interested in your feedback on this new algorithm.

Details are available from the Classifier4J website

Comments

Re: More on "XML over HTTP"

Ted Neward didn't seem to like my All Web Services should be run over HTTPS post. He rightly points out that HTTPS isn't a complete solution for web service security.

I can't be as good a writter as Ted is, because he's totally missed the point of my original post. I'm not recommending using Webservices over HTTPS because of the transport security benefits (although these exist) but because it stops tampering by well intentioned but badly designed intermediaries. Ted later points out that

An “intermediary” that wants to act on the payload isn't really an intermediary anymore, but a processing node in its own right that participates in a workflow chain. An intermediary certainly has the right and responsibility to affect the message headers, but not the payload itself. To say that SSL provides the “benefit” of preventing well-meaning intermediaries from doing this is to hide the ill-behaved nature of the intermediary itself, and doesn't properly address the problem.

This is true of course. Unfortunately in the real world clients have a habit of doing things like (as a totally imaginary example) installing badly behaved firewalls and then insisting that software that breaks because of them be “fixed” even AFTER Checkpoint admits the faults in their hardware.

Ted also says:

I'm sorry you got sold the bill of goods that said that “XML over HTTP” was supposed to be easy–it's only easy so long as you did simple things with it.

Actually that's the point. This was pretty much as simple as it could possibly be: .NET calling a RPC style Java based web service.
It really should have “just worked”. The only complexity in the system (.NET <-> Java type mappings) wasn't what caused it to break.

I'm convinced that this failure mode (people doing stupid things) breaks more enterprise systems than things that are typically planned for (hardware failure etc).

Comments

Java Syndication Library Benchmarks

I've published a few benchmarks for the Rome and Feedparser syndication libraries.

There are no real surprises in the numbers - both libraries give fairly comparable performance. Rome is quicker for small files and Feedparser is quicker for large files.

Comments