All posts by Nick Lothian

Classifier4J

My current (non-work) project is a Bayesian classifier for Java which I have named
Classifier4J (imaginative name, I know). It was inspired by Mike's
post about using something
similar for JavaBlogs.

My first rough cut seems to work pretty well. With a corpus of 7842 words from non-java sources
and 4752 from java sources (classified by me), it will tell me that:

The New Input/Output (NIO) libraries introduced in Java 2 Platform, Standard Edition (J2SE) 1.4 address this problem. NIO uses a buffer-oriented model. That is, NIO deals with data primarily in large blocks. This eliminates the overhead caused by the stream model and even makes use of OS-level facilities, where possible, to maximize throughput.

has a 99% chance of being “about Java”, and

As governments scramble to contain SARS, the World Health Organisation said it was extending the scope of its April 2 travel alert to include Beijing and the northern Chinese province of Shanxi together with Toronto, the epicentre of the SARS outbreak in Canada

has a 1% chance of being “about Java”.

Doing something like a Java blog filter isn't as easy as a spam filter
however, since the definition for “about Java” is somewhat hazy. For instance, if someone if writing about extreme programming
it won't always show up as being relevent to Java (and maybe it shouldn't?).

I hope to release Classifier4J on SourceForge real soon now.. (don't bother looking there
at the moment, there is nothing there).

commons-logging

It always surprises me how many logging packages there are available for Java. There is the ubiquious Log4J, JDK1.4's logging API, the Avalon LogKit, IBM Alphaworks LoggingToolkit4J and I'm sure there are many more I've missed.

As a programmer, writing libraries this gets pretty difficult. You want to provide logging, but you also want your loggin to integrate into the applications logging system. The best way of doing this is to use a log wrapper like Jakarta Common's Logging. If you are like me, this is one of those wonderful little projects that you see the .jars for, but never have time to investigate. However, at work a while ago I needed to integrate Velocity into an application, and so I needed to figure out its logging.

Common-Logging really does make it very, very easy to get configurable logging for your application. It can be as simple as this:

	public class Blahh {
		private Log log = LogFactory.getLog(this.getClass());

		public someMethod() {
			log.debug( "Some method called" );
		}
	}

Even I (who admit to using System.out.println() on occassion) have to admit that is pretty damn nice.

Someday, I should investigate Monolog which does a similar job.

commons-cli

Jakarta's commons-cli package is designed to make working with command line arguments from Java easy. I'm using it for some of my Classifier4J demos. I have to admit that while the documation is pretty good, I had some trouble getting going with it. The main problem I found was that the documetation uses the PosixParser to parse the command line, but I couldn't figure out how to pass the arguments to to program to make it accept them. Once I switched to using the GnuParser or the BasicParser it all worked as expected.

That brings up an interesting point – documentation is difficult to write because the person writing it often has possibly incorrect assumptions about what the user needs to know before the documentaton will help them.

Axion Take 2

After my comments about Axion, I got an email from
Rod Waldhoff, one of the Axion
developers asking me to join the axion-dev
list to discuss my issues.

I posted a summary of what I was seeing,
and some speculation about possible causes. Within 24 hours a fix
was in CVS. That's service! What is better is , as Rod noted:


inserting   5,000 rows:  ~3,589 rows/sec
inserting  10,000 rows:  ~6,309 rows/sec
inserting  50,000 rows:  ~9,498 rows/sec
inserting 100,000 rows: ~10,892 rows/sec
inserting 200,000 rows: ~11,300 rows/sec

Not only does this show 33 times the throughput in the 5,000 row test, but
the throughput gets significantly better as the number of inserts per
transaction increases (approaching some limit, of course).  The 200,000 row
tests show more than 3 times the throughput of the 5,000 row tests, in
contrast to the "before" results Nick experienced (a ~25% decrease in
throughput between 4,500 and 10,000 rows).

All I need to do is have a play.

I have actually done some benchmarking of insert speed of various embedded Java databases. I'll try and write that up sometime this
weekend. From the numbers Rod was getting Axion should be very competitive.

Axion

Axion is an Java embedded database – similar to HypersonicSQL or Mckoi – but possibly not as mature or as well known.

I was planning to use it for Classifier4J demos because of the license of Mckoi and
because I think the “log the SQL” architecture of HypersonicSQL is weird, but I've run into an interesting performance issue. Basically insert performance for persistent tables is atrocious. In some basic tests I averaged 40 inserts per
second over roughly 4500 rows (JDK 1.4.1, Athlon 2000+ 512M RAM). The most concerning thing was that as more rows were inserted performance got worse – when I insert 10000 rows I only get 30 rows per second.

First Real Test of Classifier4J

I'm sitting here watching Classifier4J run though a sample of the saved blogs from Javablogs.

Given the extremely limited set of data I trained it on I think it's doing fairly reasonably. For instance, this post
from Ted Neward is (accuratly) classified as about Java (rated 0.99), while this (for instance) is not (rates 0.01).

It's not perfect – it missed this for instance (rated 0.01),
but all in all I'm pretty happy with how accuratly it performed.

I need to work on performance, though – I'm keeping all my word ratings in a MySQL database, and doing a DB lookup for every word is killing it.

(later…) maybe I'll write a weak hash map word rating datasource that will cache the most common words.

Release of Classifier4J

I'm very close to the initial release of Classifier4J.
I've decided that it is more important
to get the code out there so people can have a play rather than supply lots
of additional demos, tools, documentation
or additional functionality. It's somewhat against my nature to release
something I aren't 100% happy with, but
Release Early, Release Often.

I think that since the API is so simple (for the most basic use you need
to use 1 method on 1 interface) I can get
away with minimal documentation, but I would like to release more demos
and/or tools. For 0.1 I'll probably release a
training tool that loads “match” words from text files in one directory,
“non-matches” from another and does little else.

About Me

I'm Nick, 28, from Adelaide, Australia. I'm married (for nearly 6 months) to a
wonderful wife.

I work as a Java developer for a mid-sized financial software company here in
Adelaide. I've been doing Java for 3 years, and before that I did Delphi (which I still
have a soft spot for).