All posts by Nick Lothian

Personalized search using user's local files

Projects on display during a Microsoft Research event yesterday included a method for personalizing Web search results based on the contents of the files on an individual user's computer hard drive.

The project reflects a broader push in the industry to improve the relevance of Web search results by tailoring them to the person doing the searching. But other programs, including Google's personalized search engine, have approached the challenge by having users create profiles to define their preferences.

http://seattlepi.nwsource.com/business/214288_techfest03.html
via Greg Linden

If I wanted to improve personalized search I'd use the user's email (either GMail, Hotmail or Yahoo Mail) to discover their interests. I'd use any organisation the user has done using their mail folders to cluster the search results and I'd use their contact lists to highligh results from people they know.

Ask.com (who I'd give the prize for “second best search engine after Google” to) don't have an email offering so they couldn't compete in that way. They do, however have Bloglines which would allow significant personalization. Yahoo (with MyYahoo) and MSN (with something – is it called MyMSN?) both have blog readers integrated which would allow discovery of the users interest as well.

If I were building a new search engine and I wanted to compete on the basis of personalized search the first thing I'd do is make it easy to subscribe to search results. A search subscription is a great indicator of interest! Unfortunately there are some technical difficulties involved in identifying who is subscribed to which search (most client-side aggregators don't share cookies with browsers) but there are ways around this (unique URLs for each subscription, and only logged in users can create subscriptions).

PubSub.com is currently the closest thing around to an ultimate database of users search preferences, though.

Move the code to the data

Kevin Schofield point to some Jim Gray papers, including a new (Jan 2005) one that I hadn't read.

The paper discusses the challenges around working with multi-petrabyte scientific datasets. There are some interesting approaches discussed (including Google's Map Reduce of courser).

However, I often wonder why this little gem from Dan Creswell never got more attention. He has written extensions to JavaSpaces that allow the code (think queries & processing instructions) to migrate to the data, instead of the code trying to download the data and process it. In a prototype system:


I then ran a test with each version submitting ten objects and then removing them from the queue. The code-uploading version was nearly 7 times as fast and I'm certain that as concurrency increases, the performance gap will get greater still as contention increases

Code Downloading for Improved Performance

I'd love to have time to look at this more, but I'd imagine that would be an approach that would work well.

Stupid Google Operating System Meme

Why do people insist on thinking Google is writing an operating system? Surely people realise
they already have one, which allows them to build innovative but
useful things on top of it.

It seems to me that Google's strategy – as far as they have one – is to commoditize the entire software stack
using web applications, and then do the best possible implementation. Then they make money by knowing more about
their users than anyone else. Bearing this in mind I just don't see where shipping an operating system to consumers
makes sense. They do need to make sure that users have a decent browser
on as many systems as possible, though – hence the money they are putting into Firefox.

So why would Google hire people like Mark Lucovsky? Well..
he's a smart guy so Google would want him, they do have a fairly big cluster of computers that need operating systems so
he'd want to go there. Plus – and everyone seems to be ignoring this –
he architected Hailstorm:

“I had these ideas that the way to really bootstrap Web services was to
come up with a model where data was the central pivot point, and we came
up with an architecture for connecting people and applications with information”

Perhaps Google might be just a little bit interested in doing something like that?

It's not like he's the first operating system person they've hired, anyway.

Using JDK 1.5 to optimize for modern CPUs

Modern CPUs have many features that increase multi-threaded performance
(eg: hyperthreading on Pentium 4s and similar features on recent PowerPC chips).
Over the next year the trend towards multi-cored CPUs from AMD, Intel and Sun will accelerate multithreaded performance while single threaded performance will begin to level off.

Java has always had excellent threading support, but JDK 1.5 introduces a whole new set of concurrency libraries which make multi-threaded programming easier. In theory these libraries should mesh well with modern CPUs, since (on a hyperthreading CPU) each thread appears as an extra CPU.

I've written a program to investigate multithreaded performance under Java. Does hyperthreading help multithreaded Java programs, or is the VM unable to use it properly?

My program is pretty simple – it generates random numbers and converts them to a string inside a loop a set amount of time. This loop is executed four time – twice by a single thread, and then twice by two threads simultanously. We would hope to see the multithreaded version run quicker on a hyperthreaded CPU.

Results:

2984.2 MHz Intel Pentium 4 3 GHz (Hyperthreading On):

Java Environment
    Sun Microsystems Inc. Java HotSpot(TM) Client VM 1.5.0_01-b08
Native Environment
    Windows XP 5.1 on x86
    2 CPU(s) detected
Please wait. Running Tests..
Single Threaded Test completed.
Dual Threaded Test completed.

Results
----------------------------------
Single Thread Time = 49241 ms.
Dual Thread Time = 35776 ms.

2984.2 MHz Intel Pentium 4 3 GHz (Hyperthreading Off):

Results
----------------------------------
Single Thread Time = 46101 ms.
Dual Thread Time = 50646 ms.

1668.8 MHz AMD Athlon(tm) XP 2000+:

Java Environment
Sun Microsystems Inc. Java HotSpot(TM) Client VM 1.5.0_01-b08
Native Environment
Windows 2000 5.0 on x86
1 CPU(s) detected
Please wait. Running Tests..
Single Threaded Test completed.
Dual Threaded Test completed.

Results
----------------------------------
Single Thread Time = 54844 ms.
Dual Thread Time = 66937 ms. 

As you can see, hyperthreading really does work (27% quicker). However, we shouldn't just fire off threads everywhere possible, because multithreaded code runs will run significantly slower (9.8% on the P4 and 22% on the Athlon) than single threaded code on conventional CPUs.

Code similar to the following may be a suitable strategy:


int numThreads = osMBean.getAvailableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
while (tasksNotExecuting()) {
   executor.submit(someLongRunningTask());
}

I'd be interested in other people results with the same tests. The more the better, but I'm especially interested in some more exotic environments – PowerPC5 based server, multi CPU machines and G5 Macs.

My program is available as a jar (run using java -jar threadtest.jar under Java 5). I use CPUChk to get the CPU id. Please record your results in the comments, with the machine CPU used to generate them and (if possible) the hyperthreading status (it can usually be enabled & disabled in the bios).

Please send 400 Bad Request and don't drop connection

Christopher Baus suggests that HTTP Servers should not send 400 Bad Request but should drop connections instead.

As a HTTP client developer let me beg people not to do this. While it is fairly rare that bad requests happen they still do occur occasionally and each one is a nightmare to debug.

For instance, Dave Johnson ran into a fairly typical problem last week. A particular server company had configured their software so it blocked any requests from software that had the word “Java” in the user agent. This took some effort to debug, but would have just about been impossible if the connections were being dropped.

Unfortunately software that intercepts, processes and sometimes modifies requests and responses like this are becoming increasingly commmon. While they seem to be a good idea, and appear to work okay when you browse the website with a common webbrowser they often break things in non-obvious ways.

The deeper I get into the internet software stack the more amazed I am that anything actually works at all. You'd think that TCP/IP->HTTP->XML/HTML is so comon that all the bugs would be ironed out by now – but that isn't true. It is full of edge cases and unexplored scenarios where things just break – or at least no one knows the correct way to do things.

Anyway – please don't go and create adhoc modifications to the HTTP spec like this suggestion (although it is fine to modify the error messageso it doesn't give too much information away)

Re: RSS Aggregators are the killer app

Ted talks about how he expects RSS aggregators to start chewing CPU time.

I've done some experiments in this area, and Bayesian classification on 4000 items a day would currently be an interesting performance tuning Lose Weight Exercise. In my experience it isn't CPU bound, though – it's I/O bound.

I have a few ideas about things that might perform better that Bayesian classification anyway, but these techniques (as well as things like Latent Semantic Indexing) will be more CPU hungry, though.

Everytime I think about trying to do LSI (or even Vector Space Search) on a couple of million items I start looking at the vector processor units on modern video cards and start drooling. Forget the CPU – off load that processing to the GPU. There still will be problems with disk and memory I/O, but the processing power is there.

(A couple of times I've actually began investigating this. It would be an excellent project to add GPU co-processing to Classifier4J and/or Lucene. JOGL may be the best way to do it.)

GPGPU.org is a decent site for more stuff about this.