Google has a new paper out: Interpreting the Data: Parallel Analysis with Sawzall. It discusses a custom interpreted “little language” called Sawzall which Google uses for muc of its data processing ontop of their Map/Reduce infrastructure.
As seems typical for Google, one of the most impressive things is the numbers:
One measure of Sawzall’s utility is how much data processing it does. We monitored its use during
the month of March 2005. During that time, on one dedicated Workqueue cluster with 1500 Xeon
CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While
running those jobs, 18,636 failures occurred (application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2×1015 bytes
of data (2.8PB) and wrote 9.9×1012 bytes (9.3TB) (demonstrating that the term “data reduction”
has some resonance). The average job therefore processed about 100GB. The jobs collectively
consumed almost exactly one machine-century.
You know you have a serious amount of data when 9.3 TBs is your reduced dataset!
Another interesting thing was its error handling:
Sawzall therefore provides a mode, set by a run-time flag, that changes the default behavior of
undefined values. Normally, using an undefined value (other than in an initialization or def() test)
will terminate the program with an error report. When the run-time flag is set, however, Sawzall
simply elides all statements that depend on the undefined value. For the corrupted record, it's as
though the elements of the calculation that depended on the bad value were temporarily removed
from the program. Any time such an elision occurs, the run-time will record the circumstances in a
special pre-defined collection table that serves as a log of the errors. When the run completes
the user may check whether the error rate was low enough to accept the results that remain.
and (I actually laughed out loud at this):
This is an unusual way to treat errors, but it is very convenient in practice. The idea is related
to some independent work by Rinard et al. [14] in which the gcc C compiler was modified to
generate code that protected against errors. In that compiler, if a program indexes off the end
of the array, the generated code will make up values and the program can continue obliviously.
This peculiar behavior has the empirical property of making programs like web servers much more
robust against failure, even in the face of malicious
Anyway, as I've previously suggested when processing this quantity of data it makes a lot of sense to move the code as close to the data as possible rather than transmit the data across the network. Google has used this technique with Map/Reduce, GFS and now Sawzall to make it a reality.