Archive for July, 2003

Implementing JSR168

At my day job, I (somewhat to my shock/joy) appear to be implementing a JSR168 (ie, java portlet) container. I probably
won't do the whole spec - we need it so we can deploy portlets on it until there are better options, but I do aim to get
it as complete as possible.

Believe it or not, I'm pretty sure I'm going to be able to do the whole thing inside a servlet. Has anyone else done
this? Any tips or pointers?

If only Pluto was in a useful state…

Comments

On JavaBlogs

There has
been a bit of recent discussion about the fact that
as JavaBlogs grows it is changing, with a few problems with what some people see as
low quality posts.

Gerard has outlined the four main methods of making
a community scale, but I would like to suggest a fifth. IMO, I believe that automatted text categorisation can increase the
size a community can scale to without requiring non-software intervention.

I've done some
experimentation with using text analysis algorithms
for simple match/non-match categorisation. I believe something as simple as Bayesian classification for blog posts can go some way
to improving the quality of links on the “Hot List”.

Todays Java.Blogs posts

Today's Java.Blogs posts

Ultimatly, I think that some of the more advanced text categoriation algorithms
might be even more useful. For instance, Google News manages to categorise its stories fairly well, and I believe they do most
of that automatically. NewsInEssence categorises news into “clusters” atomatically.
A quick look on citeseer shows plenty of
algorithms around, and I'm pretty sure the author of Classifier4J
might be interested in implementing at least one.

Comments

Text Summary Webapp.

laughingmeme pointed at my post on Classifier4J's
text summary API today, and did a nice comparison with the OS X and Open Text summarizers.
Unfortunalty, the author couldn't run Classifier4J, so I've made a web-app available to test.

It's ugly, it's nasty, but it mostly works. Try playing with the number of sentances parameter, because if you stick
with 1 sentance you tend to get the first sentance most of the time. Enjoy, and let me know your comments.

Example (from the java.util.Collection javadocs):

The root interface in the collection hierarchy. A collection represents a group of objects, known as its elements. Some collections allow duplicate elements and others do not. Some are ordered and others unordered. The SDK does not provide any direct implementations of this interface: it provides implementations of more specific subinterfaces like Set and List. This interface is typically used to pass collections around and manipulate them where maximum generality is desired.

Bags or multisets (unordered collections that may contain duplicate elements) should implement this interface directly.

All general-purpose Collection implementation classes (which typically implement Collection indirectly through one of its subinterfaces) should provide two “standard” constructors: a void (no arguments) constructor, which creates an empty collection, and a constructor with a single argument of type Collection, which creates a new collection with the same elements as its argument. In effect, the latter constructor allows the user to copy any collection, producing an equivalent collection of the desired implementation type. There is no way to enforce this convention (as interfaces cannot contain constructors) but all of the general-purpose Collection implementations in the SDK comply.

A three sentance summary gives:

The root interface in the collection hierarchy. A collection represents a group of objects, known as its elements. All general-purpose Collection implementation classes (which typically implement Collection indirectly through one of its subinterfaces) should provide two “standard” constructors: a void (no arguments) constructor, which creates an empty collection, and a constructor with a single argument of type Collection, which creates a new collection with the same elements as its argument.

which I think is rather good.

Comments

Text Summaries in Java

Ted Leung's post about the text summarisation in MacOS X got me
back working on the text summarisation in Classifier4J.

I committed an early cut of the code tonight - it works pretty well, but needs a lot of optimisation.

It allows you to specify how many sentances you want the summary to be. Here's a summary of Ted's post in two sentances:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in.

here it is with three:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in.

and this is four:

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. This is a great thing to have as a system service.

Apparently, the MacOS X service comes up with:

It looks like you just dump all your information in there and turn it's recognizers loose and it sorts it all out for you.
One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. I've been looking for something like that for a long time.
…It turns out that the Open Text Summarization library being used in AbiWord is now up on SourceForge.

That might be a bit better than the Classifier4J output, but not too much. Mentioning the Open Text Summarization library is
useful, but I think Classifier4J's choice of “This is a great thing to have as a system service.” instead of
“I've been looking for something like that for a long time.” is better. I also think the Classifier4J summary
makes better sense than the OS X one, because the first sentance provides better context - your mileage may vary,
though.

The code for this is available from the Classifier4J CVS archive in the net.sf.classifier4J.summariser (note the spelling!)
package. If it doesn't appear to be there, that's just the STOOPID sourceforge CVS backup thing - they run the Anon CVS access off the backup server, so it takes a day for
it to get copied over.

Comments

More Bayesian Blog classification

The more I use Classifier4J enabled version of NNTP//RSS,
the more convinced I am that this is a useful innovation.

Now I've ironed out a few bugs in the initial version, I'm making it available for people to play with. Note this isn't
an official NNTP//RSS release or patch, use at your own risk and all the bugs are mine and not Jason's.
In particular the user interface for classification (”Tick a check-box if you like an article and press “Classify”) is kind
of crude. However, I am using it as my primary aggregator.

Instructions:

  1. Download and install NNTP//RSS version 0.3 from sourceforge.
  2. Download my patch.
  3. Unzip my patch over the top of the NNTP//RSS installation.
  4. Point your newsreader at NNTP//RSS. Train the classifier by going through a few blogs and classifying the articles.
  5. Marvel at how well something so simple works.

There is a slight performance hit everytime you read the a new blog in your newsreader for the first time - this is Classifier4J working.
Items which are considered matches have [ClassifierMatch] appended to the subject line. An additional header “Match-Probability” is also
provided which shows how well an item “matches”.

Comments

Classifier4J version 0.3 is now available.

Classifer4J is a java library that provides an API for automatic
classification of text, including Bayesian classification. Version 0.3 is the first version recommened for general use.

Some of the many improvements include:

  • The ability to train the BayesianClassifier via a ITrainable interface, rather than requiring updates
    to the datasource.
  • Performance and design improvements to the JDBCWordDataSource.
  • Stop Word support.
  • Internal Refactoring, particually with respect to the WordProbability object (thanks to Pete Leschev).

Classifier4J is available from http://classifier4j.sourceforge.net/

Comments

Classifier4J, NNTP//RSS and Bayesian Blog Classification.

I now have Classifier4J and
nntp//rss working together to do Bayesian classification of RSS feeds.
There are a few things still to work out (perfomance and usability to name two), but I'm pretty pleased with it, since it
was something I whipped up in a couple of hours. AFAIK it is the first Bayesian/RSS thing that has got far enough to have a screenshot…

(Updated to fix link to image)

Comments

More dotNet vs Java

http://weblogs.asp.net/jprismon/posts/9824.aspx raised a couple of valid points that need more than the superficial
comments I posted yesterday.

In particular:

Microsoft has completly committed to .NET. Longhorn's new features are all managed code.

I've done a bit of research about this, and I'm not convinced it is true. While all the new features in
Longhorn (eg the file system, Active Directory enhancements etc) will undoubtably expose managed code interfaces
I doubt they will themsleves be written in .NET. I know the new version of IIS has some of the code moved into the Windows kernel
(I think the correct terminology is “it runs in ring 0″?), and code that performance optimised is unlikely to be managed.
Note that you wouldn't do it in Java, either, so this isn't a particular weakness of .NET

Microsoft's most profitable Business Aplications are being ported as we speak. BizTalk, Office,
and the OS all have managed serviced components now, and the next version of SQL will have
extremly rich CLR support.

This is true, and is a big deal. Increasingly I suspect we'll see Office's .NET interfaces used from other applications,
kind of like people use to automate Excel & Word from COM. This time it will be easier, and Office will be designed for
doing the kind of batch-procssing & workflow which people want. In the Java world OpenOffice exposes some Java interfaces.
I can't comment on how good they are. However, many databases (Oracle, DB2, Sybase ASE & ASA for example) all have
extremely rich Java support. This is pretty mature and looks good in comparison to SQL Server's .NET support.

Interoperability rocks in .NET. Not just platform (mono is doing a great job) but also interop based on the WS-I stack

I don't really understand what is being said here. Interoperability with what? I do a lot of fairly hairy integration work
in my day job and I can speak from experience when I say that 7 Bit ASCII works really well in both .NET & Java, but
anything else seems to have edge cases that have issues (Mostly in the crappy proprietary libraries we need to use). SOAP
over HTTP is usually okay from both .NET & Java. Over all, I don't see this as a particular win either way. From the
integration point of view J2EE has the JCA spec which is quite nice - unfortunalty you need to rely on your vendor to
suply a JCA complient connector, though.

Java is at best a niche platform. When was the last time you saw any non server/specialized software written in Java?
Of the top ten software software packages (Windows, Office, SAP, PeopleSoft, Oracle, SQL, Quicken,
Quickbooks, TaxCut, Microsoft Money) how many of them are actually written in java? 0/10. Microsoft
owns 90% of the CPU market. Microsoft has decided to slip .NET until Longhorn, but it is out there in the
hands of extremly productive developers.

This is a fair point (although SAP, PeopleSoft & Oracle all have significant Java components). How many are written
in .NET, though? (0/10) I'll conceed that Office & SQL Server will have significant .NET components in the next release,
but that will really only match what is in Oracle, SAP & PeopleSoft right now.

Reflection, Inspection, Attributes and Events. Simpler in .NET, more powerful in .NET.

Yes to Reflection, Inspection & Attributes. I'd also add the dynamic code generator thing .NET has (whatever that is
called), and delegates. JDK1.5 will close this gap somewhat, though, and most of these features can be emulated in Java
right now. I don't know why you'd say .NET events are better.

ASP.net is a solid step up from ASP. Seperate of presentation and business logic is much more solid,
the rendering pipeline is more powerfull, and the security features rock.

Yes, ASP.net is a lot better than ASP. The Java servlet spec compares very well with it, though,
and there are a lot more third party Java tools than for .NET.

Sun fails the Dogfood test. Number of critical applications in Solaris that are or are being ported to Java?
None, ask Sun why that is (not scalable, not fast). How much of Windows is being ported?
The whole Shabang (see Longhorn). I will be happy to re-examine Java seriously for ongoing work when
Sun's rm6 utilities (including the command lines) are written in Java.

True. MS is always pretty good at dogfooding their stuff (except for Visual Source Safe!! What's up with that!!!).
However, I think it is an exaggeration to say MS is writing all of Longhorn in .NET.

Not only that, Sun is now lifting features from .NET, clearly there is some new and cool features here to get
the ever slow sun to actually change their precious language.

I don't think either platform can afford to get into the “you copied this from us” game (cough.. C#… cough…).

Compact Framework. Share code between WinCE devices and your platform. Tie them together via Webservices
with a single click of the mouse.

Java has .NET beaten here. Java is on millions of phones and PDAs right now, and has thousands of applications in use.

Rich clients. Have the interoperability and accessability of the web without stateless programming
enviornment and pretty graphics.

Java has Webstart. However, I'd agree that .NET is a better rich client platform.

Integration. Don't want to rewrite all of your companies security? Use Domains and Roles.
Don't want to implement your own message Queue? Already There. How about Transactions, JIT ACtivation,
automagic threading? Done.

I really don't understand this one. Java is very, very strong in all these areas, with thousands of deployed
applications.

Overall, I'd say on these points it's not a clear win to either platform. The important point is that both
platforms are strong in some areas, and to say that isn't true is just FUD. .NET is a very, very good platform
and you'd be silly to write it off.

Comments

Re: Another ignorant discussion on .Net is 'better' than Java

I read http://freeroller.net/comments/Sosume?anchor=another_ignorant_discussion_on_net today, about
http://weblogs.asp.net/jprismon/posts/9824.aspx and I felt the need to comment (nothing like a good .NET vs Java
argument, is there?)

Exactly which comments are ignorant? I'd say I agree with most of his comments.
I might argue the toss on:

  • (4) - I only have superficial .NET, so I can't argue too hard
  • (14) - I'd say MIDP & J2ME are stronger in the market than the compact framework.
  • (16) - I don't quite follow the argument here. Security: java has a pretty good case here. Message Queue: use JMS.
    Transactions: JTA. JIT Activation: javax.activation.* Threading: java.lang.Thread

but the rest sounds reasonably accurate to me.

Comments

Use 1.4 RegExp

Mr strayneuron is
complaining about things that require JDK 1.4, in particular Amazon's Web Services API
which uses RegExp.

If it needs a RegExp, why not use JDK1.4? It's been out for over 18 months is
stable, faster that 1.3 and has better features. IMHO it is better to use something
in the core API than introduce an external package. I've used ORO fairly extensivly,
but I prefer to use 1.4 RegExp when I can, even at work (where I am much more conservative
with choosing mature APIs). It really is a pretty good implementation, and compares
well with anything else out there.

Part of the reason the code he is complaining about is only one line is because
they use 1.4 RegExp. If they had used ORO it takes at least 2 statements, plus an import.
On could argue that regular expressions should have been in the language all along,
which I would have to agree with.

Comments