Classifier4J

My current (non-work) project is a Bayesian classifier for Java which I have named
Classifier4J (imaginative name, I know). It was inspired by Mike's
post about using something
similar for JavaBlogs.

My first rough cut seems to work pretty well. With a corpus of 7842 words from non-java sources
and 4752 from java sources (classified by me), it will tell me that:

The New Input/Output (NIO) libraries introduced in Java 2 Platform, Standard Edition (J2SE) 1.4 address this problem. NIO uses a buffer-oriented model. That is, NIO deals with data primarily in large blocks. This eliminates the overhead caused by the stream model and even makes use of OS-level facilities, where possible, to maximize throughput.

has a 99% chance of being “about Java”, and

As governments scramble to contain SARS, the World Health Organisation said it was extending the scope of its April 2 travel alert to include Beijing and the northern Chinese province of Shanxi together with Toronto, the epicentre of the SARS outbreak in Canada

has a 1% chance of being “about Java”.

Doing something like a Java blog filter isn't as easy as a spam filter
however, since the definition for “about Java” is somewhat hazy. For instance, if someone if writing about extreme programming
it won't always show up as being relevent to Java (and maybe it shouldn't?).

I hope to release Classifier4J on SourceForge real soon now.. (don't bother looking there
at the moment, there is nothing there).

Leave a Reply

Your email address will not be published. Required fields are marked *