Wednesday, December 29, 2010

Incrementing Hadoop counters in Sizzle

Inspired by this post describing how to increment Hadoop counters from within Pig, I thought it would be beneficial to explain how to do the same from a Sizzle program.

In Sizzle, counters are somewhat of an anomaly as, unlike the rest of the stock aggregators, counters are output to the screen at the end of the run and saved in the job history, but they are not otherwise written out to disk.

Being out-of-band makes them useful as debugging tools and indicators of problems; however, they can also be used to gather statistics on the input.

Here, we will do some of the latter and write a short Sizzle program to count the words in an input file.

Firstly, declare an mrcounter (which stands for MapReduce Counter) table to be used to count the words:

counter: table mrcounter[string] of int;

This declares that counter names a mrcounter table that will be indexed by string and will accept an int value.

Next, read the input into a variable, named word:

word: string = input;

Finally, emit a one to the counter table indexed by the word just read.

emit counter[word] <- 1;


After compiling it, running this program on the following input:

foo
bar
bar
baz
baz
baz

yields something like the following output to the terminal at the conclusion of the run:

[...]
10/12/28 17:21:24 INFO mapred.JobClient: Sizzle Counters
10/12/28 17:21:24 INFO mapred.JobClient: bar=2
10/12/28 17:21:24 INFO mapred.JobClient: foo=1
10/12/28 17:21:24 INFO mapred.JobClient: baz=3
[...]

In this example there is only one index into the mrcounter table, so the default counter group header "Sizzle Counters" is used. If more than one index into the mrcounter table is provided, the first is used as the counter group heading, and any subsequent ones are concatenated together to be the counter name.

Happy new year!