All posts by Nick Lothian

An indisputable benefit of Apache Geronimo

Apache Geronimo is (as you'd expect/hope) attempting to reuse as much code
as possible from assorted Jakarta project.

One benefit of this is that it could put some pressure on some of the
smaller Jakarta projects (especially Jakarta Commons) to actually do some
releases instead of requiring people to use DEV or CVS versions of the code.

Unfortunately, at the moment Geronimo is using Maven for builds which tends to
make it pretty easy to use non-release versions of various jars.

(I'll let Hani deal with anyone who thinks that using CVS versions of
software is a good idea, but here's a hint: Xerces, Batik and FOP jar hell.)

A Java bug worthy of your vote

At work we've been stung (fortunately on an internal system only) by a JDK
bug which we weren't aware of. Did you know that under JDK1.4.1 once 2036
files are open any subsequent opens will delete the file that was supposed
to be opened? Obviously this is NOT WHAT WAS SUPPOSED TO HAPPEN!! Note that
the file that gets deleted could be a class, a jar or something like web.xml
– it gets deleted! Tracking down THAT bug was NOT FUN. We found it in a
web-app. It was behaving weirdly, so I attempted to restart it, which meant
the web.xml was re-read, and deleted!

See
http://developer.java.sun.com/developer/bugParade/bugs/4779905.html
for the gory details and a test case.

Even if you are careful with making sure you close your files, you can never
be sure about 3rd party components.

Fortunately, experimentation has show that it seems to be fixed in JDK1.4.2 –
although the bug isn't marked as closed. Nice to see Sun keeps track of what
bugs they fix…

More on JSR168

As mentioned previously, I'm currently
implementing JSR168. I've now got an implementation that does most of what we need at work (Portlets work, can switch modes,
can switch window states etc).

I'm pretty with the spec – it appears to be pretty well thought out. There are a couple of small issues (how
init parameters and portlet preferences relate, for instance, and I noticed one inconsistancy between the Javadoc and the
Spec which I can't remember right now), but mostly it has been an enjoyable experience.

I've noticed that Hani has done an
implementation, and
Rickard is doing one too.

I'm wondering what people are going to do when the reference
implementation
becomes available? At the moment the jetspeed2 cvs is un-buildable because it required the Pluto jars from IBM,
which aren't released yet. The Pluto jars contain the reference
implementation for a JSR168 portlet container
.When they become available will people switch to them? I probably intend
to, depending on how easily we can embed them into out infrastructure. I won't know for sure until I see the code, though,
and there's no real solid date on when that will happen. In the mean time, I'm working on a JSP/Taglib based Portlet aggregator,
with no way of knowing how it will need to tie into the reference container.

(BTW, is there a mailing list or something that other container implementors discuss aggregator design on? I'd love
to pick the brains of those who have been-there-done-that…)

I think I have just invented something new

Most people who know me think I'm fairly creative (or crazy, depending on how you look at it). Once I heard someone
say that only 10% of the ideas that most people have are any good. My theory has always been
that means there are two methods of maximising the number of good ideas I could possibly have, I'm going to make sure
I use both of them. Bear that in mind when you read the rest of this…..

I have always been aware that most of my ideas are very derivative – I just repackage something someone else invented.
Five minutes ago, though, I think I came up with an idea for a program that no one else has ever had, let alone implemented.

I call it the Fiki. Imagine a Wiki, except that instead of creating links LikeSo, the software analyzes each
phrase and automagically creates links to pages that talk about that phrase. How cool would that be?

I think that is the best idea I've ever had – even better than my idea for a swimming cap and goggles all-in-one which
I had before Speedo came out with it at the Atlanta Olympics. I can't understand why no-one used it – I guess it did
look kind of odd…. ;-)

Anyhow – what do people think? Is this a good idea, or do I need to return to reality and become a chef or something?

Implementing JSR168

At my day job, I (somewhat to my shock/joy) appear to be implementing a JSR168 (ie, java portlet) container. I probably
won't do the whole spec – we need it so we can deploy portlets on it until there are better options, but I do aim to get
it as complete as possible.

Believe it or not, I'm pretty sure I'm going to be able to do the whole thing inside a servlet. Has anyone else done
this? Any tips or pointers?

If only Pluto was in a useful state…

On JavaBlogs

There has
been a bit of recent discussion about the fact that
as JavaBlogs grows it is changing, with a few problems with what some people see as
low quality posts.

Gerard has outlined the four main methods of making
a community scale, but I would like to suggest a fifth. IMO, I believe that automatted text categorisation can increase the
size a community can scale to without requiring non-software intervention.

I've done some
experimentation with using text analysis algorithms
for simple match/non-match categorisation. I believe something as simple as Bayesian classification for blog posts can go some way
to improving the quality of links on the “Hot List”.

Todays Java.Blogs posts

Today's Java.Blogs posts

Ultimatly, I think that some of the more advanced text categoriation algorithms
might be even more useful. For instance, Google News manages to categorise its stories fairly well, and I believe they do most
of that automatically. NewsInEssence categorises news into “clusters” atomatically.
A quick look on citeseer shows plenty of
algorithms around, and I'm pretty sure the author of Classifier4J
might be interested in implementing at least one.

Text Summary Webapp.

laughingmeme pointed at my post on Classifier4J's
text summary API today, and did a nice comparison with the OS X and Open Text summarizers.
Unfortunalty, the author couldn't run Classifier4J, so I've made a web-app available to test.

It's ugly, it's nasty, but it mostly works. Try playing with the number of sentances parameter, because if you stick
with 1 sentance you tend to get the first sentance most of the time. Enjoy, and let me know your comments.

Example (from the java.util.Collection javadocs):

The root interface in the collection hierarchy. A collection represents a group of objects, known as its elements. Some collections allow duplicate elements and others do not. Some are ordered and others unordered. The SDK does not provide any direct implementations of this interface: it provides implementations of more specific subinterfaces like Set and List. This interface is typically used to pass collections around and manipulate them where maximum generality is desired.

Bags or multisets (unordered collections that may contain duplicate elements) should implement this interface directly.

All general-purpose Collection implementation classes (which typically implement Collection indirectly through one of its subinterfaces) should provide two “standard” constructors: a void (no arguments) constructor, which creates an empty collection, and a constructor with a single argument of type Collection, which creates a new collection with the same elements as its argument. In effect, the latter constructor allows the user to copy any collection, producing an equivalent collection of the desired implementation type. There is no way to enforce this convention (as interfaces cannot contain constructors) but all of the general-purpose Collection implementations in the SDK comply.

A three sentance summary gives:

The root interface in the collection hierarchy. A collection represents a group of objects, known as its elements. All general-purpose Collection implementation classes (which typically implement Collection indirectly through one of its subinterfaces) should provide two “standard” constructors: a void (no arguments) constructor, which creates an empty collection, and a constructor with a single argument of type Collection, which creates a new collection with the same elements as its argument.

which I think is rather good.

Text Summaries in Java

Ted Leung's post about the text summarisation in MacOS X got me
back working on the text summarisation in Classifier4J.

I committed an early cut of the code tonight – it works pretty well, but needs a lot of optimisation.

It allows you to specify how many sentances you want the summary to be. Here's a summary of Ted's post in two sentances:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in.

here it is with three:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in.

and this is four:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. This is a great thing to have as a system service.

Apparently, the MacOS X service comes up with:

It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you.
One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. I've been looking for something like that for a long time.
…It turns out that the Open Text Summarization library being used in AbiWord is now up on SourceForge.

That might be a bit better than the Classifier4J output, but not too much. Mentioning the Open Text Summarization library is
useful, but I think Classifier4J's choice of “This is a great thing to have as a system service.” instead of
“I've been looking for something like that for a long time.” is better. I also think the Classifier4J summary
makes better sense than the OS X one, because the first sentance provides better context – your mileage may vary,
though.

The code for this is available from the Classifier4J CVS archive in the net.sf.classifier4J.summariser (note the spelling!)
package. If it doesn't appear to be there, that's just the STOOPID sourceforge CVS backup thing – they run the Anon CVS access off the backup server, so it takes a day for
it to get copied over.

More Bayesian Blog classification

The more I use Classifier4J enabled version of NNTP//RSS,
the more convinced I am that this is a useful innovation.

Now I've ironed out a few bugs in the initial version, I'm making it available for people to play with. Note this isn't
an official NNTP//RSS release or patch, use at your own risk and all the bugs are mine and not Jason's.
In particular the user interface for classification (“Tick a check-box if you like an article and press “Classify”) is kind
of crude. However, I am using it as my primary aggregator.

Instructions:

  1. Download and install NNTP//RSS version 0.3 from sourceforge.
  2. Download my patch.
  3. Unzip my patch over the top of the NNTP//RSS installation.
  4. Point your newsreader at NNTP//RSS. Train the classifier by going through a few blogs and classifying the articles.
  5. Marvel at how well something so simple works.

There is a slight performance hit everytime you read the a new blog in your newsreader for the first time – this is Classifier4J working.
Items which are considered matches have [ClassifierMatch] appended to the subject line. An additional header “Match-Probability” is also
provided which shows how well an item “matches”.