Ted talks about how he expects RSS aggregators to start chewing CPU time.
I've done some experiments in this area, and Bayesian classification on 4000 items a day would currently be an interesting performance tuning . In my experience it isn't CPU bound, though – it's I/O bound. Exercise
I have a few ideas about things that might perform better that Bayesian classification anyway, but these techniques (as well as things like Latent Semantic Indexing) will be more CPU hungry, though.
Everytime I think about trying to do LSI (or even Vector Space Search) on a couple of million items I start looking at the vector processor units on modern video cards and start drooling. Forget the CPU – off load that processing to the GPU. There still will be problems with disk and memory I/O, but the processing power is there.
(A couple of times I've actually began investigating this. It would be an excellent project to add GPU co-processing to Classifier4J and/or Lucene. JOGL may be the best way to do it.)
GPGPU.org is a decent site for more stuff about this.