Category Archives: Uncategorized

Stupid Google Operating System Meme

Why do people insist on thinking Google is writing an operating system? Surely people realise
they already have one, which allows them to build innovative but
useful things on top of it.

It seems to me that Google's strategy – as far as they have one – is to commoditize the entire software stack
using web applications, and then do the best possible implementation. Then they make money by knowing more about
their users than anyone else. Bearing this in mind I just don't see where shipping an operating system to consumers
makes sense. They do need to make sure that users have a decent browser
on as many systems as possible, though – hence the money they are putting into Firefox.

So why would Google hire people like Mark Lucovsky? Well..
he's a smart guy so Google would want him, they do have a fairly big cluster of computers that need operating systems so
he'd want to go there. Plus – and everyone seems to be ignoring this –
he architected Hailstorm:

“I had these ideas that the way to really bootstrap Web services was to
come up with a model where data was the central pivot point, and we came
up with an architecture for connecting people and applications with information”

Perhaps Google might be just a little bit interested in doing something like that?

It's not like he's the first operating system person they've hired, anyway.

Using JDK 1.5 to optimize for modern CPUs

Modern CPUs have many features that increase multi-threaded performance
(eg: hyperthreading on Pentium 4s and similar features on recent PowerPC chips).
Over the next year the trend towards multi-cored CPUs from AMD, Intel and Sun will accelerate multithreaded performance while single threaded performance will begin to level off.

Java has always had excellent threading support, but JDK 1.5 introduces a whole new set of concurrency libraries which make multi-threaded programming easier. In theory these libraries should mesh well with modern CPUs, since (on a hyperthreading CPU) each thread appears as an extra CPU.

I've written a program to investigate multithreaded performance under Java. Does hyperthreading help multithreaded Java programs, or is the VM unable to use it properly?

My program is pretty simple – it generates random numbers and converts them to a string inside a loop a set amount of time. This loop is executed four time – twice by a single thread, and then twice by two threads simultanously. We would hope to see the multithreaded version run quicker on a hyperthreaded CPU.

Results:

2984.2 MHz Intel Pentium 4 3 GHz (Hyperthreading On):

Java Environment
    Sun Microsystems Inc. Java HotSpot(TM) Client VM 1.5.0_01-b08
Native Environment
    Windows XP 5.1 on x86
    2 CPU(s) detected
Please wait. Running Tests..
Single Threaded Test completed.
Dual Threaded Test completed.

Results
----------------------------------
Single Thread Time = 49241 ms.
Dual Thread Time = 35776 ms.

2984.2 MHz Intel Pentium 4 3 GHz (Hyperthreading Off):

Results
----------------------------------
Single Thread Time = 46101 ms.
Dual Thread Time = 50646 ms.

1668.8 MHz AMD Athlon(tm) XP 2000+:

Java Environment
Sun Microsystems Inc. Java HotSpot(TM) Client VM 1.5.0_01-b08
Native Environment
Windows 2000 5.0 on x86
1 CPU(s) detected
Please wait. Running Tests..
Single Threaded Test completed.
Dual Threaded Test completed.

Results
----------------------------------
Single Thread Time = 54844 ms.
Dual Thread Time = 66937 ms. 

As you can see, hyperthreading really does work (27% quicker). However, we shouldn't just fire off threads everywhere possible, because multithreaded code runs will run significantly slower (9.8% on the P4 and 22% on the Athlon) than single threaded code on conventional CPUs.

Code similar to the following may be a suitable strategy:


int numThreads = osMBean.getAvailableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
while (tasksNotExecuting()) {
   executor.submit(someLongRunningTask());
}

I'd be interested in other people results with the same tests. The more the better, but I'm especially interested in some more exotic environments – PowerPC5 based server, multi CPU machines and G5 Macs.

My program is available as a jar (run using java -jar threadtest.jar under Java 5). I use CPUChk to get the CPU id. Please record your results in the comments, with the machine CPU used to generate them and (if possible) the hyperthreading status (it can usually be enabled & disabled in the bios).

Please send 400 Bad Request and don't drop connection

Christopher Baus suggests that HTTP Servers should not send 400 Bad Request but should drop connections instead.

As a HTTP client developer let me beg people not to do this. While it is fairly rare that bad requests happen they still do occur occasionally and each one is a nightmare to debug.

For instance, Dave Johnson ran into a fairly typical problem last week. A particular server company had configured their software so it blocked any requests from software that had the word “Java” in the user agent. This took some effort to debug, but would have just about been impossible if the connections were being dropped.

Unfortunately software that intercepts, processes and sometimes modifies requests and responses like this are becoming increasingly commmon. While they seem to be a good idea, and appear to work okay when you browse the website with a common webbrowser they often break things in non-obvious ways.

The deeper I get into the internet software stack the more amazed I am that anything actually works at all. You'd think that TCP/IP->HTTP->XML/HTML is so comon that all the bugs would be ironed out by now – but that isn't true. It is full of edge cases and unexplored scenarios where things just break – or at least no one knows the correct way to do things.

Anyway – please don't go and create adhoc modifications to the HTTP spec like this suggestion (although it is fine to modify the error messageso it doesn't give too much information away)

Re: RSS Aggregators are the killer app

Ted talks about how he expects RSS aggregators to start chewing CPU time.

I've done some experiments in this area, and Bayesian classification on 4000 items a day would currently be an interesting performance tuning Lose Weight Exercise. In my experience it isn't CPU bound, though – it's I/O bound.

I have a few ideas about things that might perform better that Bayesian classification anyway, but these techniques (as well as things like Latent Semantic Indexing) will be more CPU hungry, though.

Everytime I think about trying to do LSI (or even Vector Space Search) on a couple of million items I start looking at the vector processor units on modern video cards and start drooling. Forget the CPU – off load that processing to the GPU. There still will be problems with disk and memory I/O, but the processing power is there.

(A couple of times I've actually began investigating this. It would be an excellent project to add GPU co-processing to Classifier4J and/or Lucene. JOGL may be the best way to do it.)

GPGPU.org is a decent site for more stuff about this.

The "Bloglines Index is worth lots" meme

John Battelle (and others) seem to think that Blogline's index of 280 million blogs posts is one of it's most valuable assets.

While there's no doubt that much data is worth something I doubt any search company would find it very valuable. After all:

  1. They already have the data in their main index (in the form of the pages it came from) and (more importantly)
  2. Blogging (and related applications) is concerned with the most recent, up to date data possible.

Old blog posts do add depth, but nothing like the amount a real search engines index can add.

On the other hand the detailed data about a million user's interests, and information about who created what content…. that is something worth paying for.

Classifier4J 0.6 released

I've just released Classifier4J 0.6. This new release includes a rather nice (I think) new classifier (the VectorClassifer) based on the vector space search algorithm This particular classifier is fast, doesn't require training for non-matches and is very suitable for sorting data into various categories.

If you've looked at Classifier4J in the past and run into performance problems with the Bayesian algorithm I'd be interested in your feedback on this new algorithm.

Details are available from the Classifier4J website