The 1.5 million “words”
I claimed JavaBlogs has needs to be clarified
slightly. They are “words” as defined by running String.split(“\\W”)
on all the posts archived. The [\W] regular expression is defined as a “A
non-word character” – any character that is not in “a-zA-Z_0-9”. For normal
english sentences from a book that is probably a reasonable definition –
however when used on blogs where there is a large number of urls it doesn't
quite work. For instance, we suddenly find that “http” is one of the most
popular “words” in the english language. That's because all urls are split
on their non-word characters – so http://www.javablogs.com is split into
“http”, “www”, “javablogs” & “com”. Also, dates like 2-May-2003 or
25/12/2002 are split on the “-” and “/” characters, so “2002” and “2003” are
very common words.
My current thoughts are to try splitting on “\s” – ie whitespace.