Tuesday, November 16, 2010

Running a Sizzle Program on Hadoop

Last week I discussed the basic anatomy of a Sizzle program by showing the syntax of a small program, and promised that this week I would describe how to run it on a Hadoop cluster.

The program is still here, comments included, but I've also placed it inline for easier cut and paste:

best: table top(3)[url: string] of referer: string weight count: int;
line: string = input;
fields: array of string = saw(line, ".*GET ", "[^\t ]+",
    " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");
emit best[fields[1]] <- fields[5] weight 1;


First, dump that code into a file called 'best.szl'. Sizzle is a compiled language, so next we run the Sizzle compiler on it with something similar to the following command line:

bash$ java -jar ~/Projects/Sizzle/dist/sizzle-compiler.jar -h ~/lib/hadoop-0.21.0 -i best.szl -o best.jar

Of course, you will need to change the command line to match your environment.  The sizzle-compiler.jar specified by -jar is produced by building the Sizzle project, so point it to where you cloned Sizzle to.  The Hadoop directory specified by -h is the location of the Hadoop distribution, so just point it to where you downloaded and unpacked Hadoop.  The last two parameters specify the Sizzle program to be used as input and the name of the jar file to be output, respectively.

The compilation process first type checks the program to make sure that it will run correctly, then  produces Java class files that are subsequently bundled into the jar file specified by the user.

The compiler will print some status messages to your screen while it works, to let you know what it is doing:

[...] compiling [...]/sizzle/best.java
[...] adding [...]/sizzle/best.class to best.jar
[...] adding [...]/sizzle/best$bestSizzleCombiner.class to best.jar
[...] adding [...]/sizzle/best$bestSizzleMapper.class to best.jar
[...] adding [...]/sizzle/best.java to best.jar
[...] adding [...]/sizzle/best$bestSizzleReducer.class to best.jar
[...] adding lib/sizzle-runtime.jar to best.jar

As you can see from the output of the compiler, it has the same contents as the typical Hadoop job jar: namely, a main class to run the job, and its complement of a Mapper, a Combiner and a Reducer that do the actual work.  Since it is just a Hadoop job jar, you can now put this onto any Hadoop cluster and run it, assuming you have some Apache logs there to run it on.

In order to get those logs on the cluster, just use the hadoop command line program, like thus:

bash$ hadoop fs -put web.log /users/anthonyu

Then run the job with Hadoop, too:

bash$ hadoop jar best.jar sizzle.best web.log output

This command line just asks Hadoop to run the class "sizzle.best"* from the jar "best.jar" on the input "web.log," and to place the output into "output." Hadoop will produce output that is far too copious to replicate here, but eventually the job will complete and print to the screen that it has.  Then, you can retrieve the output with a commandline like this:

bash$ hadoop fs -cat output/part-*

Which will display the output to your terminal, something like this:

best[http://example.com/] = -, 3984
best[http://example.com/] = http://google.com/search?q=example, 1587
best[http://example.com/] = http://bing.com/search?q=example, 1233
best[http://login.example.com/] = http://example.com/, 3984, 0
best[http://login.example.com/] = http://login.example.com, 1282
best[http://login.example.com/] = http://login.example.com/lostpassword, 470

Go ahead and try this on your own and let me know how it works, either here in the comments or via Twitter @SizzleLanguage.

* Sawzall has no namespaces, so Sizzle just dumps everything into the "sizzle" package.

No comments:

Post a Comment