The Sizzle Language: hadoop

Inspired by this post describing how to increment Hadoop counters from within Pig, I thought it would be beneficial to explain how to do the same from a Sizzle program.

In Sizzle, counters are somewhat of an anomaly as, unlike the rest of the stock aggregators, counters are output to the screen at the end of the run and saved in the job history, but they are not otherwise written out to disk.

Being out-of-band makes them useful as debugging tools and indicators of problems; however, they can also be used to gather statistics on the input.

Here, we will do some of the latter and write a short Sizzle program to count the words in an input file.

Firstly, declare an mrcounter (which stands for MapReduce Counter) table to be used to count the words:



counter: table mrcounter[string] of int;

This declares that counter names a mrcounter table that will be indexed by string and will accept an int value.

Next, read the input into a variable, named word:



word: string = input;

Finally, emit a one to the counter table indexed by the word just read.



emit counter[word] <- 1;

After compiling it, running this program on the following input:



foo

bar

bar

baz

baz

baz

yields something like the following output to the terminal at the conclusion of the run:



[...]

10/12/28 17:21:24 INFO mapred.JobClient:   Sizzle Counters

10/12/28 17:21:24 INFO mapred.JobClient:     bar=2

10/12/28 17:21:24 INFO mapred.JobClient:     foo=1

10/12/28 17:21:24 INFO mapred.JobClient:     baz=3

[...]

In this example there is only one index into the mrcounter table, so the default counter group header "Sizzle Counters" is used. If more than one index into the mrcounter table is provided, the first is used as the counter group heading, and any subsequent ones are concatenated together to be the counter name.

Happy new year!

Last week I discussed the basic anatomy of a Sizzle program by showing the syntax of a small program, and promised that this week I would describe how to run it on a Hadoop cluster.

The program is still here, comments included, but I've also placed it inline for easier cut and paste:

 best: table top(3)[url: string] of referer: string weight count: int;

line: string = input;

fields: array of string = saw(line, ".*GET ", "[^\t ]+",

    " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");

emit best[fields[1]] <- fields[5] weight 1;

First, dump that code into a file called 'best.szl'. Sizzle is a compiled language, so next we run the Sizzle compiler on it with something similar to the following command line:

 bash$ java -jar ~/Projects/Sizzle/dist/sizzle-compiler.jar -h ~/lib/hadoop-0.21.0 -i best.szl -o best.jar

Of course, you will need to change the command line to match your environment. The sizzle-compiler.jar specified by -jar is produced by building the Sizzle project, so point it to where you cloned Sizzle to. The Hadoop directory specified by -h is the location of the Hadoop distribution, so just point it to where you downloaded and unpacked Hadoop. The last two parameters specify the Sizzle program to be used as input and the name of the jar file to be output, respectively.

The compilation process first type checks the program to make sure that it will run correctly, then produces Java class files that are subsequently bundled into the jar file specified by the user.

The compiler will print some status messages to your screen while it works, to let you know what it is doing:



[...] compiling [...]/sizzle/best.java

[...] adding [...]/sizzle/best.class to best.jar

[...] adding [...]/sizzle/best$bestSizzleCombiner.class to best.jar

[...] adding [...]/sizzle/best$bestSizzleMapper.class to best.jar

[...] adding [...]/sizzle/best.java to best.jar

[...] adding [...]/sizzle/best$bestSizzleReducer.class to best.jar

[...] adding lib/sizzle-runtime.jar to best.jar

As you can see from the output of the compiler, it has the same contents as the typical Hadoop job jar: namely, a main class to run the job, and its complement of a Mapper, a Combiner and a Reducer that do the actual work. Since it is just a Hadoop job jar, you can now put this onto any Hadoop cluster and run it, assuming you have some Apache logs there to run it on.

In order to get those logs on the cluster, just use the hadoop command line program, like thus:



bash$ hadoop fs -put web.log /users/anthonyu

Then run the job with Hadoop, too:

 bash$ hadoop jar best.jar sizzle.best web.log output

This command line just asks Hadoop to run the class "sizzle.best"* from the jar "best.jar" on the input "web.log," and to place the output into "output." Hadoop will produce output that is far too copious to replicate here, but eventually the job will complete and print to the screen that it has. Then, you can retrieve the output with a commandline like this:

 bash$ hadoop fs -cat output/part-*

Which will display the output to your terminal, something like this:



best[http://example.com/] = -, 3984

best[http://example.com/] = http://google.com/search?q=example, 1587

best[http://example.com/] = http://bing.com/search?q=example, 1233

best[http://login.example.com/] = http://example.com/, 3984, 0

best[http://login.example.com/] = http://login.example.com, 1282

best[http://login.example.com/] = http://login.example.com/lostpassword, 470

Go ahead and try this on your own and let me know how it works, either here in the comments or via Twitter @SizzleLanguage.

* Sawzall has no namespaces, so Sizzle just dumps everything into the "sizzle" package.

The Sizzle Language

Wednesday, December 29, 2010

Incrementing Hadoop counters in Sizzle

Tuesday, November 16, 2010

Running a Sizzle Program on Hadoop