Java Vector Space Search and Latent Semantic Indexing

Ted Leung pointed at Latent
Semantic Indexing today, which got me reading some papers. The patent situation is
unfortunate, because it is a pretty nice technique. Then, thanks to
Technorati
, I found this
which led to Building a Vector Space Search Engine in Perl.

Now this isn't quite latent semantic indexing, but it uses some of the same techniques.
I'm not sure what the patent situation is – this seems fairly trivial, but who knows? Either
way, this technique is really, really good for those times where you want to categorise
text into a number of potential categories, mainly because it isn't too resource intensive.
Compared to Bayesian classification, it appears that the algorithm should be much, much
quicker than even the best Bayesian implementation.

I figured that anything they can do in Perl, I can do in Java, so I'd like to present
my very ugly Java version.
This isn't nice code, and it doesn't do Stemming (note to self –
look at using Lucene's stemming code), it uses doubles instead of BigDecimals (and/or BitSets),
but it appears to work. I haven't done vector math for a long time, so I might have screwed that up
somewhere, too.

However, it's something I'm going to look at more in the future. I'll probably
build a classifier for Classifier4J based
on it, and compare it to my Bayesian classifier.

One thought on “Java Vector Space Search and Latent Semantic Indexing

  1. hey can u please put your main.java file again..
    just needed to take a look at your code… it will be very useful for me.. thanks..

Leave a Reply

Your email address will not be published. Required fields are marked *