It seems that blog search engines have trouble detecting duplicate feeds (and entries). It might be worth investigating using Bloom filters to detect those posts. If I get time I'll try and do a demo.
Category Archives: Uncategorized
Search Engine Indexing Speed
Tristan Louis has written a couple of articles on the number of hits for various bloggers in three search engines: Google, MSN and Technorati. See http://www.tnl.net/blog/entry/Secrets_of_the_A-list_bloggers:_Technorati_vs._Google and http://www.tnl.net/blog/entry/Technorati_Yahoo_and_Google_Too.
A number of people have pointed out that there are problems with his methodology and the aim of the experiment itself. Tim Bray says it well: “Almost all the modern engines do a pretty damn good job of getting you something appropriate and useful in the first handful of results. Who cares about the next million?”, but if you want all the details of what is wrong with this study, see Danny Sullivan's post.
Anyway, I'm interested in search engine comparisons, but right now I'm more interested in how fast things get in the index than how many million results something returns, so over the last couple of daya I conducted a small experiment.
Firstly, I posted a blog post entitled “Agro the Aggregator” and then about 12 hours later I used my Argos search engine library to poll six search engines ever half hour with the query “Agro the Aggregator” for 19 hours. I then counted the results by iterating over them all (ie, the links were manually counted without relying on the “result count” returned by the search engines which can be inaccurate).
Unfortunately I started the experiment too late to catch which engine found a result first, but Blogdigger, Google and Yahoo all had results by the time I started searching.
However, the results do show the following:
- Google finds the most results, although they fluctuate. I could not replicate the way those results dropped back to 2 hits using manual search, so it is possible that this is an artifact of using the Google API. IN the manual search, Google also correctly identifies a number of these 16 posts as being duplicate content (ie, my blog post re-aggregated).
- Blogdigger returned results the quickest out of any of the 3 specialist blog search engines (Blogdigger, Feedster and Technorati). This was despite the fact that Technorati was pinged directly with the blog posting. I suspect this may have something to do with Blogdiggers use of the FeedMesh to find new posts quickly.
Agro the Aggregator
Agro the Aggregator is my experimental proof of concept
syndication client. Basically I'm not happy with the web based aggregators currently available, and Agro lets me
experiment with various ideas I have for improving them. (Note that this was 3 or 4 days of after work hacking,
so don't expect the world…)
In particular, I'm interested in using textual analysis to help the user find the information
that interests them as quickly as possible.
I'm also interested in using some of the modern web-as-a-platform webservices to proactivly
gather and present related information to the user.
Currently, Agro uses Classifier4J to allow the user to filter sites by topic (currently
limited to “Java programming” and “US Politics”). It also uses Classifier4J to extract keywords
for each item, and them some AJAX techniques (using DWR) to retrieve related news from Yahoo.
Agro the Aggregator is ugly in all browsers (I hate CSS), but works best in Firefox (actually I haven't
bothered to look at it in IE).
Some of the software used to build Agro the Aggregator includes:
Agro currently aggregates the following sites (click the link to see the aggregated view):
Google Maps now supports Australia
Google Maps now supports Australia and includes Satellite pictures of most cities (including Adelaide).
Google's Sawzall
Google has a new paper out: Interpreting the Data: Parallel Analysis with Sawzall. It discusses a custom interpreted “little language” called Sawzall which Google uses for muc of its data processing ontop of their Map/Reduce infrastructure.
As seems typical for Google, one of the most impressive things is the numbers:
One measure of Sawzall’s utility is how much data processing it does. We monitored its use during
the month of March 2005. During that time, on one dedicated Workqueue cluster with 1500 Xeon
CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While
running those jobs, 18,636 failures occurred (application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2×1015 bytes
of data (2.8PB) and wrote 9.9×1012 bytes (9.3TB) (demonstrating that the term “data reduction”
has some resonance). The average job therefore processed about 100GB. The jobs collectively
consumed almost exactly one machine-century.
You know you have a serious amount of data when 9.3 TBs is your reduced dataset!
Another interesting thing was its error handling:
Sawzall therefore provides a mode, set by a run-time flag, that changes the default behavior of
undefined values. Normally, using an undefined value (other than in an initialization or def() test)
will terminate the program with an error report. When the run-time flag is set, however, Sawzall
simply elides all statements that depend on the undefined value. For the corrupted record, it's as
though the elements of the calculation that depended on the bad value were temporarily removed
from the program. Any time such an elision occurs, the run-time will record the circumstances in a
special pre-defined collection table that serves as a log of the errors. When the run completes
the user may check whether the error rate was low enough to accept the results that remain.
and (I actually laughed out loud at this):
This is an unusual way to treat errors, but it is very convenient in practice. The idea is related
to some independent work by Rinard et al. [14] in which the gcc C compiler was modified to
generate code that protected against errors. In that compiler, if a program indexes off the end
of the array, the generated code will make up values and the program can continue obliviously.
This peculiar behavior has the empirical property of making programs like web servers much more
robust against failure, even in the face of malicious
Anyway, as I've previously suggested when processing this quantity of data it makes a lot of sense to move the code as close to the data as possible rather than transmit the data across the network. Google has used this technique with Map/Reduce, GFS and now Sawzall to make it a reality.
On Java String Concatenation
Most Java programmers know to use a StringBuffer (or a JDK 1.5 StringBuilder) when concatenating Strings in Java,
but it occurred to me that the compiler might be able to optimize some of these calls.
For instance, look at the following code:
StringBuffer buf = new StringBuffer(); buf.append("one" + "two");
There is an obvious optimization there: combine “one” and “two” into a single String before appending them to the
StringBuffer.
Sure enough:
L0 (0) NEW StringBuffer DUP INVOKESPECIAL StringBuffer.() : void ASTORE 0: buf L1 (5) ALOAD 0: buf LDC "onetwo" INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer POP
Note the LDC "onetwo" line – the compiler has combined the two strings in the bytecode itself.
However, the following code:
String one = "str1"; StringBuffer buf = new StringBuffer(); buf.append(one + "two");
gives:
L0 (0) LDC "str1" ASTORE 0: one L1 (3) NEW StringBuffer DUP INVOKESPECIAL StringBuffer.() : void ASTORE 1: buf L2 (8) ALOAD 1: buf NEW StringBuffer DUP ALOAD 0: one INVOKESTATIC String.valueOf(Object) : String INVOKESPECIAL StringBuffer. (String) : void LDC "two" INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer INVOKEVIRTUAL StringBuffer.toString() : String INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer POP
Hmm – that isn't good at all. The code one + "two" causes the compiler to create a new StringBuffer, concatenate
the two Strings, call toString on the temporary StringBuffer and append that to the original StringBuffer. Looks like that
should be written:
String one = "str1"; StringBuffer buf = new StringBuffer(); buf.append(one); buf.append("two");
which gives:
L0 (0) LDC "str1" ASTORE 0: one L1 (3) NEW StringBuffer DUP INVOKESPECIAL StringBuffer.() : void ASTORE 1: buf L2 (8) ALOAD 1: buf ALOAD 0: one INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer POP L3 (13) ALOAD 1: buf LDC "two" INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer POP
Much better!
There is one final (pun intended) change that I found interesting:
final String one = "str1"; StringBuffer buf = new StringBuffer(); buf.append(one + "two");
gives:
L0 (0) LDC "str1" ASTORE 0: one L1 (3) NEW StringBuffer DUP INVOKESPECIAL StringBuffer.() : void ASTORE 1: buf L2 (8) ALOAD 1: buf LDC "str1two" INVOKEVIRTUAL StringBuffer.append(String) : StringBuffer POP
Nice!
Obviously most of this is pretty logical and well known by most Java programmers. I found it interesting to see exactly what
optimizations the complier (as opposed to the JVM) is doing though. (Note that these experiments were done using the
Eclipse 3.0 JDK 1.4 complier. Other compiler optimizations may vary)
Cute way to convert an int to a String
I picked up a nice little tip for easy conversion between an int and a String. Previously I'd always done something like
int one = 1;
String str = String.valueOf(one);
The alternative is
int one = 1;
String str = one + "";
From the bytecode point of view it isn't quite as efficient:
L0 (0) ICONST_1 ISTORE 1: one L1 (3) ILOAD 1: one INVOKESTATIC String.valueOf(int) : String ASTORE 2: str L2 (7) RETURN L3 (9)
vs
L0 (0) ICONST_1 ISTORE 1: one L1 (3) NEW StringBuffer DUP ILOAD 1: one INVOKESTATIC String.valueOf(int) : String INVOKESPECIAL StringBuffer.(String) : void INVOKEVIRTUAL StringBuffer.toString() : String ASTORE 2: str L2 (11) RETURN L3 (13)
All the same, I'd never thought of using autoconversion to convert to a string like that before. (note this doesn't rely on JDK 1.5 autoboxing)
Software Optimization
Over on my Apache Harmony blog I've just pointed at an excellent presentation on Software Optimization and Virtual Machines.
One thing that I think a wide audience might be interested in is this excerp, from a report by Sevitzky, Mitchell and Srinivasan:
J2EE benchmark creates 10 business objects (w/6 fields) from a SOAP message of bytes. 10,953 calls, 1,492 objects created
Modern JVMs are amazing creations!
No more comments
Due to a flood of comment spam I've had to turn off comments.
The embarassment that is modern tech journalism
Paul Murphy has written a blog entry entitled Microsoft to buy Red Hat? Say it ain’t so. Ignoring the blatent speculation in that post, there are so many factual errors and obvious mistakes in the analysis that it appears he knows absolutly nothing about what he is talking about.
Consider this gem:
The biggest threat Red Hat faces right now is that IBM could settle with SCO and then release its own Linux along with workstations and servers based on the Cell processor.
Consider the way the SCO case is currently positioned. It looks like IBM will win, probably be awarded damages and SCO will probably be delisted and (I suspect) wound up as a company since their entire business model rest on winning that case. Why exactly would IBM consider settling the case now if they didn't earlier?
Then there is this:
With SuSe essentially out of the picture, Linspire in a world by itself, and Debian not getting the press it deserves, such a move by IBM would leave Red Hat with nowhere to go except a suicidal head-to-head competition with Microsoft in the x86 marketplace.Given that Cell outperforms x86 by an order of magnitude and doesn’t have the security weaknesses built into the x86, this would leave them fighting to hold an ever decreasing share of a shrinking market.
Geeze – Intel & AMD had better give up now! They have no hope against the magic of the Cell processor! Of course, there is the small problem of the Cell requiring entirely new programming techniques to get the best out of it – but I'm guessing Murphy didn't understand that.
Finally there is this bit of logic:
Getting acquired therefore makes sense as Red Hat’s Plan B -but Microsoft’s Plan B has traditionally been Plan A delayed a few years and I can see no reasonable business scenario under which the acquisition makes sense for them.
If I understand that bit of logic correctly, I think he's giving himself an excuse to use when Microsoft doesn't buy Red Hat. I think that might be the smartest bit of work he did in that piece…