All posts by Nick Lothian

Agro the Aggregator: now with added Goodness (Google Maps and Findory)

Agro the Aggregator now includes information from Findory, and it will display relevant satellite pictures
for a limited number of cities thanks to Google's new Maps API.

The Findory supports works in a similar way to the previous Yahoo news support (extract keywords and pass them
to the Findory API), but the Google API is somewhat different.

Currently it's quite limited: it scans through a list of known city names, and then allows the user
to display the satellite pictures from the first one found (even if other locations are also mentioned in the item).

For instance, this post will be aggregated via the JavaBlogs
and PlanetApache aggregator views, and will
(hopefully) allow you to see a photo of Aberdeen, despite support for the following cities:

  • Aberdeen
  • Adelaide
  • Albuquerque
  • Algiers
  • Amarillo
  • Amsterdam
  • Anchorage
  • Ankara
  • Asuncion
  • Athens
  • Atlanta
  • Auckland
  • Austin
  • Baghdad
  • Gaza
  • London
  • Madrid
  • Moscow
  • Seattle
  • Washington

Generally speaking, looking at the BBC Feed will should find you something that talks about one of these cities.

It's still Firefox only, though!

Another nice project would be to reverse the UI on this: Display a map of the world, with the location sensitive headlines displayed at the
correct location. I'm sure fame (if not fortune) awaits the first to implement that one…

Search Engine Indexing Speed

Tristan Louis has written a couple of articles on the number of hits for various bloggers in three search engines: Google, MSN and Technorati. See http://www.tnl.net/blog/entry/Secrets_of_the_A-list_bloggers:_Technorati_vs._Google and http://www.tnl.net/blog/entry/Technorati_Yahoo_and_Google_Too.

A number of people have pointed out that there are problems with his methodology and the aim of the experiment itself. Tim Bray says it well: “Almost all the modern engines do a pretty damn good job of getting you something appropriate and useful in the first handful of results. Who cares about the next million?”, but if you want all the details of what is wrong with this study, see Danny Sullivan's post.

Anyway, I'm interested in search engine comparisons, but right now I'm more interested in how fast things get in the index than how many million results something returns, so over the last couple of daya I conducted a small experiment.

Firstly, I posted a blog post entitled “Agro the Aggregator” and then about 12 hours later I used my Argos search engine library to poll six search engines ever half hour with the query “Agro the Aggregator” for 19 hours. I then counted the results by iterating over them all (ie, the links were manually counted without relying on the “result count” returned by the search engines which can be inaccurate).

Unfortunately I started the experiment too late to catch which engine found a result first, but Blogdigger, Google and Yahoo all had results by the time I started searching.

However, the results do show the following:

  • Google finds the most results, although they fluctuate. I could not replicate the way those results dropped back to 2 hits using manual search, so it is possible that this is an artifact of using the Google API. IN the manual search, Google also correctly identifies a number of these 16 posts as being duplicate content (ie, my blog post re-aggregated).
  • Blogdigger returned results the quickest out of any of the 3 specialist blog search engines (Blogdigger, Feedster and Technorati). This was despite the fact that Technorati was pinged directly with the blog posting. I suspect this may have something to do with Blogdiggers use of the FeedMesh to find new posts quickly.

Hits for

Agro the Aggregator

Agro the Aggregator is my experimental proof of concept
syndication client. Basically I'm not happy with the web based aggregators currently available, and Agro lets me
experiment with various ideas I have for improving them. (Note that this was 3 or 4 days of after work hacking,
so don't expect the world…)

In particular, I'm interested in using textual analysis to help the user find the information
that interests them as quickly as possible.

I'm also interested in using some of the modern web-as-a-platform webservices to proactivly
gather and present related information to the user.

Currently, Agro uses Classifier4J to allow the user to filter sites by topic (currently
limited to “Java programming” and “US Politics”). It also uses Classifier4J to extract keywords
for each item, and them some AJAX techniques (using DWR) to retrieve related news from Yahoo.

Agro the Aggregator is ugly in all browsers (I hate CSS), but works best in Firefox (actually I haven't
bothered to look at it in IE).

Some of the software used to build Agro the Aggregator includes:

Agro currently aggregates the following sites (click the link to see the aggregated view):

Google's Sawzall

Google has a new paper out: Interpreting the Data: Parallel Analysis with Sawzall. It discusses a custom interpreted “little language” called Sawzall which Google uses for muc of its data processing ontop of their Map/Reduce infrastructure.

As seems typical for Google, one of the most impressive things is the numbers:

One measure of Sawzall’s utility is how much data processing it does. We monitored its use during
the month of March 2005. During that time, on one dedicated Workqueue cluster with 1500 Xeon
CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While
running those jobs, 18,636 failures occurred (application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2×1015 bytes
of data (2.8PB) and wrote 9.9×1012 bytes (9.3TB) (demonstrating that the term “data reduction”
has some resonance). The average job therefore processed about 100GB. The jobs collectively
consumed almost exactly one machine-century.

You know you have a serious amount of data when 9.3 TBs is your reduced dataset!

Another interesting thing was its error handling:

Sawzall therefore provides a mode, set by a run-time flag, that changes the default behavior of
undefined values. Normally, using an undefined value (other than in an initialization or def() test)
will terminate the program with an error report. When the run-time flag is set, however, Sawzall
simply elides all statements that depend on the undefined value. For the corrupted record, it's as
though the elements of the calculation that depended on the bad value were temporarily removed
from the program. Any time such an elision occurs, the run-time will record the circumstances in a
special pre-defined collection table that serves as a log of the errors. When the run completes
the user may check whether the error rate was low enough to accept the results that remain.

and (I actually laughed out loud at this):

This is an unusual way to treat errors, but it is very convenient in practice. The idea is related
to some independent work by Rinard et al. [14] in which the gcc C compiler was modified to
generate code that protected against errors. In that compiler, if a program indexes off the end
of the array, the generated code will make up values and the program can continue obliviously
.
This peculiar behavior has the empirical property of making programs like web servers much more
robust against failure, even in the face of malicious

Anyway, as I've previously suggested when processing this quantity of data it makes a lot of sense to move the code as close to the data as possible rather than transmit the data across the network. Google has used this technique with Map/Reduce, GFS and now Sawzall to make it a reality.

On Java String Concatenation

Most Java programmers know to use a StringBuffer (or a JDK 1.5 StringBuilder) when concatenating Strings in Java,
but it occurred to me that the compiler might be able to optimize some of these calls.

For instance, look at the following code:

    StringBuffer buf = new StringBuffer();
    buf.append("one" + "two");    


There is an obvious optimization there: combine “one” and “two” into a single String before appending them to the
StringBuffer.

Sure enough:

   L0 (0)
    NEW StringBuffer
    DUP
    INVOKESPECIAL StringBuffer.() : void
    ASTORE 0: buf
   L1 (5)
    ALOAD 0: buf
    LDC "onetwo"
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    POP

Note the LDC "onetwo" line – the compiler has combined the two strings in the bytecode itself.

However, the following code:

    String one = "str1";
    StringBuffer buf = new StringBuffer();
    buf.append(one + "two");    


gives:

   L0 (0)
    LDC "str1"
    ASTORE 0: one
   L1 (3)
    NEW StringBuffer
    DUP
    INVOKESPECIAL StringBuffer.() : void
    ASTORE 1: buf
   L2 (8)
    ALOAD 1: buf
    NEW StringBuffer
    DUP
    ALOAD 0: one
    INVOKESTATIC String.valueOf(Object) : String
    INVOKESPECIAL StringBuffer.(String) : void
    LDC "two"
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    INVOKEVIRTUAL StringBuffer.toString() : String
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    POP


Hmm – that isn't good at all. The code one + "two" causes the compiler to create a new StringBuffer, concatenate
the two Strings, call toString on the temporary StringBuffer and append that to the original StringBuffer. Looks like that
should be written:

    String one = "str1";
    StringBuffer buf = new StringBuffer();
    buf.append(one);
    buf.append("two");   


which gives:

   L0 (0)
    LDC "str1"
    ASTORE 0: one
   L1 (3)
    NEW StringBuffer
    DUP
    INVOKESPECIAL StringBuffer.() : void
    ASTORE 1: buf
   L2 (8)
    ALOAD 1: buf
    ALOAD 0: one
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    POP
   L3 (13)
    ALOAD 1: buf
    LDC "two"
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    POP


Much better!

There is one final (pun intended) change that I found interesting:

    final String one = "str1";
    StringBuffer buf = new StringBuffer();
    buf.append(one + "two");


gives:

   L0 (0)
    LDC "str1"
    ASTORE 0: one
   L1 (3)
    NEW StringBuffer
    DUP
    INVOKESPECIAL StringBuffer.() : void
    ASTORE 1: buf
   L2 (8)
    ALOAD 1: buf
    LDC "str1two"
    INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer
    POP


Nice!

Obviously most of this is pretty logical and well known by most Java programmers. I found it interesting to see exactly what
optimizations the complier (as opposed to the JVM) is doing though. (Note that these experiments were done using the
Eclipse 3.0 JDK 1.4 complier. Other compiler optimizations may vary)

Cute way to convert an int to a String

I picked up a nice little tip for easy conversion between an int and a String. Previously I'd always done something like


int one = 1;
String str = String.valueOf(one);

The alternative is


int one = 1;
String str = one + "";

From the bytecode point of view it isn't quite as efficient:


   L0 (0)
    ICONST_1
    ISTORE 1: one
   L1 (3)
    ILOAD 1: one
    INVOKESTATIC String.valueOf(int) : String
    ASTORE 2: str
   L2 (7)
    RETURN
   L3 (9)

vs


   L0 (0)
    ICONST_1
    ISTORE 1: one
   L1 (3)
    NEW StringBuffer
    DUP
    ILOAD 1: one
    INVOKESTATIC String.valueOf(int) : String
    INVOKESPECIAL StringBuffer.(String) : void
    INVOKEVIRTUAL StringBuffer.toString() : String
    ASTORE 2: str
   L2 (11)
    RETURN
   L3 (13)

All the same, I'd never thought of using autoconversion to convert to a string like that before. (note this doesn't rely on JDK 1.5 autoboxing)

Woken Furies mini-review

I’ve just finished Richard Morgan’s “Woken Furies”. This is the third book in the Takeshi Kovacs series (the previous two books were “Altered Carbon” and “Broken Angels”). In this book we see Kovacs back on his home planet of Harlem’s World dealing with assorted bits of intelligent military hardware, religious fanatics, gangsters, the immortal first families, a couple of Envoys and the ghost of Quellcrist Falconer.

I think this is perhaps the least shockingly violent of Morgan’s four books (“Market Forces” being the other book outside the Kovacs series). However, it still manages to make something like the “dark” Revenge of the Sith look like a childrens Christmas carol. After all this is a Richard Morgan book, so the body count is high and the violence is extreme. If Kovacs appears perhaps a little less morally ambivalent than in previous books it is only because the justification for some of his more extreme behaviour is explained to us in more detail than before.

As we expect from his previous books, the story is very fast paced. My impression is that the was it seemed shorter than “Altered Carbon” or “Broken Angels”, althugh this doesn’t appear to be tru in reality – Amazon says it is 400 pages while “Broken Angels” was 484 pages. Perhaps it is that “Woken Furies” doesn’t create quite as deep a universe as the previous books did. For instance, while “Altered Carbon” added depth to the storyline by using prior events and “Broken Angels” explored the Martian civilization, “Woken Furies” had little back-story that wasn’t directly related to the plot.

Morgan hasn’t become any less imaginative, though. The use of diseases as a substitute for recreational drugs is a device that I have never some across before and the evolving abandoned military machines were also unique.

Overall, I found “Woken Furies” an enjoyable read, but not quite to the same amazing level as “Altered Carbon”. I’d give it 4 stars.