I'm sitting here watching Classifier4J run though a sample of the saved blogs from Javablogs.
Given the extremely limited set of data I trained it on I think it's doing fairly reasonably. For instance, this post
from Ted Neward is (accuratly) classified as about Java (rated 0.99), while this (for instance) is not (rates 0.01).
It's not perfect – it missed this for instance (rated 0.01),
but all in all I'm pretty happy with how accuratly it performed.
I need to work on performance, though – I'm keeping all my word ratings in a MySQL database, and doing a DB lookup for every word is killing it.
(later…) maybe I'll write a weak hash map word rating datasource that will cache the most common words.