Category Archives: Uncategorized

First Real Test of Classifier4J

I'm sitting here watching Classifier4J run though a sample of the saved blogs from Javablogs.

Given the extremely limited set of data I trained it on I think it's doing fairly reasonably. For instance, this post
from Ted Neward is (accuratly) classified as about Java (rated 0.99), while this (for instance) is not (rates 0.01).

It's not perfect – it missed this for instance (rated 0.01),
but all in all I'm pretty happy with how accuratly it performed.

I need to work on performance, though – I'm keeping all my word ratings in a MySQL database, and doing a DB lookup for every word is killing it.

(later…) maybe I'll write a weak hash map word rating datasource that will cache the most common words.

Release of Classifier4J

I'm very close to the initial release of Classifier4J.
I've decided that it is more important
to get the code out there so people can have a play rather than supply lots
of additional demos, tools, documentation
or additional functionality. It's somewhat against my nature to release
something I aren't 100% happy with, but
Release Early, Release Often.

I think that since the API is so simple (for the most basic use you need
to use 1 method on 1 interface) I can get
away with minimal documentation, but I would like to release more demos
and/or tools. For 0.1 I'll probably release a
training tool that loads “match” words from text files in one directory,
“non-matches” from another and does little else.

Simple Semantics Resolution

Danny Ayers has come up with a very cool RSS Module proposal:
Simple Semantics Resolution. This
allows RSS 2.0 to support RDF (and all the cool semantic processing that
enables) by simply including a reference to a XSL stylesheet in the RSS. Of
course, then we'd need some end user tools that actually use some of the semantic
features in RDF, but that might be a chicken & egg kind of problem.
I'd love to hear what aggregators are using things like Dublin Core metadata for.
As far as I can see <dc:subject> might be used by
some, but that appears to be the extent of the usage.

Another Module I'm quite interested in is Easy News Topics from
Paolo Valdermarin and Matt Mower.
I need to do some more reading on their concept of a cloud, but I think what it
would allow is automatic blog classification by sites (say… JavaBlogs
using Classifier4J). Then if aggregators supported it it could allow
filtering based on topics, as defined by a particular cloud.