All posts by Nick Lothian

Using Maven

I mavenised Classifier4J tonight. I wasn't planning to, but I decided I
needed an automatic way of generating javadocs, and I couldn't remember the ant javadoc syntax. Then I decided I should
auto-generate a website, too… so one thing led to another and I ended up trying Maven.

I've had some bad experiences with Maven in the past – basically due to the lack of documentation in early version. I couldn't get a few projects to build, and couldn't figure out where to even start investigating the problem. I found that very frustrating and left them in disgust.

Tonight's experience was very different. The documentation is decent enough that I managed to get my project to build the
second time I tried. It was unsuccessful the first time as the latest version of common-logging.jar wasn't downloading – it turned out it wasn't available in the Maven repositary. Once I figured that out it was pretty easy to change the version required.

I guess the bad news is that the documentation still doesn't help much with customisation. For instance, I don't agree with a lot of the default settings for the checkstyle report which is run by default. I can't figure out how to either turn the report off, or modify the settings, and the only documentation I could find on this wasn't very helpful. On the other hand it is 12:18, and I should go to bed – so I guess that could have something to do with it, too.

LDAPd – A Java LDAP Server

LDAPd looks pretty cool. It is a Java based LDAP
implementation, based on SEDA
(which I liked when I first read the paper, but I've never actually seen used anywhere) and
Apache Phoenix, which strikes fear
into my heart.

Actually, Phoenix (and Avalon as a whole) always seems very cool, but there is more politics
per line of code than probably any project I've seen (including legendary flamefests like
EMACS vs XEMACS and the OpenBSD split from NetBSD).

Don't let that put you off, though! The code itself seems to be pretty decent, and a
LDAP server would be useful for quite a lot of things.

Java Vector Space Search and Latent Semantic Indexing

Ted Leung pointed at Latent
Semantic Indexing today, which got me reading some papers. The patent situation is
unfortunate, because it is a pretty nice technique. Then, thanks to
Technorati
, I found this
which led to Building a Vector Space Search Engine in Perl.

Now this isn't quite latent semantic indexing, but it uses some of the same techniques.
I'm not sure what the patent situation is – this seems fairly trivial, but who knows? Either
way, this technique is really, really good for those times where you want to categorise
text into a number of potential categories, mainly because it isn't too resource intensive.
Compared to Bayesian classification, it appears that the algorithm should be much, much
quicker than even the best Bayesian implementation.

I figured that anything they can do in Perl, I can do in Java, so I'd like to present
my very ugly Java version.
This isn't nice code, and it doesn't do Stemming (note to self –
look at using Lucene's stemming code), it uses doubles instead of BigDecimals (and/or BitSets),
but it appears to work. I haven't done vector math for a long time, so I might have screwed that up
somewhere, too.

However, it's something I'm going to look at more in the future. I'll probably
build a classifier for Classifier4J based
on it, and compare it to my Bayesian classifier.

1.5 Million Words in JavaBlogs?

The 1.5 million “words”
I claimed JavaBlogs has needs to be clarified
slightly. They are “words” as defined by running String.split(“\\W”)
on all the posts archived. The [\W] regular expression is defined as a “A
non-word character” – any character that is not in “a-zA-Z_0-9”. For normal
english sentences from a book that is probably a reasonable definition –
however when used on blogs where there is a large number of urls it doesn't
quite work. For instance, we suddenly find that “http” is one of the most
popular “words” in the english language. That's because all urls are split
on their non-word characters – so http://www.javablogs.com is split into
“http”, “www”, “javablogs” & “com”. Also, dates like 2-May-2003 or
25/12/2002 are split on the “-” and “/” characters, so “2002” and “2003” are
very common words.

My current thoughts are to try splitting on “\s” – ie whitespace.

Egothor – A Java Based Search Engine

Egothor is a Java based search engine. I've never tried it, but I presume it
is similar to Lucene, which I like very much.

At work we have a Java based Wiki (which I wrote), with the backend using textfiles and Lucene. Using a search
engine instead of a database is an interesting approach to use, and I chose it because I wanted to learn about Lucene (I
don't normally make architectural decisions based on what I want to learn, but I started this on my own time, then showed
it to some people at work, who started submiting patches to tie it into various things, and it just grew…).
I wouldn't recommend using a search engine for financial or other typically relational type data,
but for some classes of applications it works pretty well.

It also allowed me to integrate a lot of the data we have in various locations into one knowledge-base. By using
Apache POI, I index all the MS Word documents available on our
network (I can't remember how many, but the index is 40 meg), and make them available via the Wiki for searching.

The next step is better classification and linking of information. Classifier4J
might help me with the classification, and I can extract some extra linkage data out of assorted task, time and bug databases
we have at work, but I also need to figure out how to customize the Lucene scoring algorithm.

EclipseUML

I tried EclipseUML today. It was bloody impressive, too. I'm not a big fan of the whole Rational Unified Process or of modeling as a whole, but I do think that UML is a good tool for easy commuication of the relationship between classes.

EclipseUML is so well integrated into the EclipseIDE itself that I found myself reverse engineering packages to understand the classes in them rather than reading the code. Since the diagram browser and the source naviagator are integrated it seems logical to use the UML diagrams as a code-navigation tool – it's quicker to find a sub-class in a diagram than by searching for classes which inherit from something for instance. I've never really seen a tool that does the integration thing so well – MS Visual Modeller doesn't really come close, and Rose/Visio/ArgoUML etc make modelling such a chore I hate to do it.

Eclipse on Mono

I've been following ikvm for a while now, partly out
of interest (a Java VM running on .NET? How cool is that?), but also because I deal with some
tricky Java->COM communication issues in my day job, and anything that goes on in that area I want
to know about.

Anyway, it's recently
been announced
that Eclipse now runs on the Mono .NET VM.

I love the concept of VMs running on VMs… I'll have to download IKVM, and try and get
Joeq running under it.

Eclipse, CVS & Sourceforge

Well, Eclipse CVS support rocks. It took me about 10 minutes to do my first CVS check-in
via Eclipse. From a previous experience trying to get CVS over SSH working under windows, I wasn't looking forward to it.

For the record, use the connection type “extssh”, and fill everything else in as you'd expect and it just works.

Another interesting thing about CVS at sourceforge is that it can be accessed via
ports 80 & 443
. That could be useful to know – at work in the past I've had to use CVSGrab to get the latest builds for various projects.