Wednesday, December 29, 2010

Incrementing Hadoop counters in Sizzle

Inspired by this post describing how to increment Hadoop counters from within Pig, I thought it would be beneficial to explain how to do the same from a Sizzle program.

In Sizzle, counters are somewhat of an anomaly as, unlike the rest of the stock aggregators, counters are output to the screen at the end of the run and saved in the job history, but they are not otherwise written out to disk.

Being out-of-band makes them useful as debugging tools and indicators of problems; however, they can also be used to gather statistics on the input.

Here, we will do some of the latter and write a short Sizzle program to count the words in an input file.

Firstly, declare an mrcounter (which stands for MapReduce Counter) table to be used to count the words:

counter: table mrcounter[string] of int;

This declares that counter names a mrcounter table that will be indexed by string and will accept an int value.

Next, read the input into a variable, named word:

word: string = input;

Finally, emit a one to the counter table indexed by the word just read.

emit counter[word] <- 1;


After compiling it, running this program on the following input:

foo
bar
bar
baz
baz
baz

yields something like the following output to the terminal at the conclusion of the run:

[...]
10/12/28 17:21:24 INFO mapred.JobClient: Sizzle Counters
10/12/28 17:21:24 INFO mapred.JobClient: bar=2
10/12/28 17:21:24 INFO mapred.JobClient: foo=1
10/12/28 17:21:24 INFO mapred.JobClient: baz=3
[...]

In this example there is only one index into the mrcounter table, so the default counter group header "Sizzle Counters" is used. If more than one index into the mrcounter table is provided, the first is used as the counter group heading, and any subsequent ones are concatenated together to be the counter name.

Happy new year!

Tuesday, November 16, 2010

Hello, World!

Every programming language tutorial starts off with a "hello world" example, so Sizzle should be no different. While Sizzle typically sends its output to large files or a database, printing output to the screen is necessary for any programming language, particularly those that have no debugger.

In Sizzle, it's as easy as:

emit stdout <- "Hello World!";


In a similar vein, printing out error text goes like this:

emit stderr <- "Something bad happened!";


These work because, behind the scenes, Sizzle predefines special aggregators called 'stdout' and 'stderr' that just echo anything emitted to them out to the terminal:

stdout: table collection of s: string file("/dev/stdout") format("%s\n", s);
stderr: table collection of s: string file("/dev/stderr") format("%s\n", s);


As a side note, when running under Windows, Sizzle uses synthetic equivalents to '/dev/stdout' and '/dev/stderr', so this works under every popular operating system.

Running a Sizzle Program on Hadoop

Last week I discussed the basic anatomy of a Sizzle program by showing the syntax of a small program, and promised that this week I would describe how to run it on a Hadoop cluster.

The program is still here, comments included, but I've also placed it inline for easier cut and paste:

best: table top(3)[url: string] of referer: string weight count: int;
line: string = input;
fields: array of string = saw(line, ".*GET ", "[^\t ]+",
    " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");
emit best[fields[1]] <- fields[5] weight 1;


First, dump that code into a file called 'best.szl'. Sizzle is a compiled language, so next we run the Sizzle compiler on it with something similar to the following command line:

bash$ java -jar ~/Projects/Sizzle/dist/sizzle-compiler.jar -h ~/lib/hadoop-0.21.0 -i best.szl -o best.jar

Of course, you will need to change the command line to match your environment.  The sizzle-compiler.jar specified by -jar is produced by building the Sizzle project, so point it to where you cloned Sizzle to.  The Hadoop directory specified by -h is the location of the Hadoop distribution, so just point it to where you downloaded and unpacked Hadoop.  The last two parameters specify the Sizzle program to be used as input and the name of the jar file to be output, respectively.

The compilation process first type checks the program to make sure that it will run correctly, then  produces Java class files that are subsequently bundled into the jar file specified by the user.

The compiler will print some status messages to your screen while it works, to let you know what it is doing:

[...] compiling [...]/sizzle/best.java
[...] adding [...]/sizzle/best.class to best.jar
[...] adding [...]/sizzle/best$bestSizzleCombiner.class to best.jar
[...] adding [...]/sizzle/best$bestSizzleMapper.class to best.jar
[...] adding [...]/sizzle/best.java to best.jar
[...] adding [...]/sizzle/best$bestSizzleReducer.class to best.jar
[...] adding lib/sizzle-runtime.jar to best.jar

As you can see from the output of the compiler, it has the same contents as the typical Hadoop job jar: namely, a main class to run the job, and its complement of a Mapper, a Combiner and a Reducer that do the actual work.  Since it is just a Hadoop job jar, you can now put this onto any Hadoop cluster and run it, assuming you have some Apache logs there to run it on.

In order to get those logs on the cluster, just use the hadoop command line program, like thus:

bash$ hadoop fs -put web.log /users/anthonyu

Then run the job with Hadoop, too:

bash$ hadoop jar best.jar sizzle.best web.log output

This command line just asks Hadoop to run the class "sizzle.best"* from the jar "best.jar" on the input "web.log," and to place the output into "output." Hadoop will produce output that is far too copious to replicate here, but eventually the job will complete and print to the screen that it has.  Then, you can retrieve the output with a commandline like this:

bash$ hadoop fs -cat output/part-*

Which will display the output to your terminal, something like this:

best[http://example.com/] = -, 3984
best[http://example.com/] = http://google.com/search?q=example, 1587
best[http://example.com/] = http://bing.com/search?q=example, 1233
best[http://login.example.com/] = http://example.com/, 3984, 0
best[http://login.example.com/] = http://login.example.com, 1282
best[http://login.example.com/] = http://login.example.com/lostpassword, 470

Go ahead and try this on your own and let me know how it works, either here in the comments or via Twitter @SizzleLanguage.

* Sawzall has no namespaces, so Sizzle just dumps everything into the "sizzle" package.

Thursday, November 11, 2010

The Anatomy of a Sizzle Program

Now is probably a good time to discuss the syntax of Sizzle programs. While you saw a little bit of syntax in the previous post, in this one will describe a more useful program.

This program is called "best_referers.szl" and is available in the source and binary distributions under the top-level directory "examples." It takes as input the log files from a web server and outputs the top three referers for each page that had been requested from the web server.

This first line has a lot going on. First, declare a 'top' aggregator, which filters out all but the top n values emitted to it.  It takes the name "best" for later reference, and the "3" in parentheses next to the name specifies that we are only interested in the top three results from all values emitted to it. Next, the words inside the square brackets declare that "best" will be indexed by a string, in this case the URL of the page. Finally, the last clause indicates that the values emitted will be strings weighted by an int, in this case the count.

best: table top(3)[string] of string weight int;


The aggregator is the main output point for Sizzle programs. In essence, they are just prefabricated reducers, doing some specific aggregation on the values emitted to them, then writing their output to disk.

The next line is simpler, it just declares a string named "line" to hold the input value.

line: string = input;


In Sizzle, a program works on a single value at a time. Just like a Hadoop Mapper class, the program doesn't have access to the previous or next value, or any global state. This constraint allows for many optimizations that would otherwise be impossible.


Next, use the 'saw' function to parse the line. 'saw' uses a variable number of regular expressions to chop a line into an array, sort of like any other languages' 'split' function on steroids. In this case, chop it and store it in an array of strings called "fields."

fields: array of string = saw(line, ".*GET ", "[^\t ]+", " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");


'saw' regexes arePerl Compatible Regular Expressions, so their syntax will be familiar to most programmers. In this case we are chopping Apache combined logs, so the 'saw' function will work in one call, whereas doing the same in Perl or Java 'split' would be substantially more complicared.


Finally, just emit the URL, stored in "fields" at index 1, and its referer, stored at index 5, to the "best" table with a count of one.

emit best[fields[1]] <- fields[5] weight 1;



Running this program on a web server's logs outputs a text file with summaries of the top three referers for each page, something like this:

best[http://example.com/] = -, 3984
best[http://example.com/] = http://google.com/search?q=example, 1587
best[http://example.com/] = http://bing.com/search?q=example, 1233
best[http://login.example.com/] = http://example.com/, 3984, 0
best[http://login.example.com/] = http://login.example.com, 1282
best[http://login.example.com/] = http://login.example.com/lostpassword, 470


Stay tuned; in the next post, I will discuss in detail how to run a Sizzle progam and retrieve its output.

Monday, November 8, 2010

What makes Sizzle different?

Since the announcement last Friday, I have gotten a few questions about Sizzle, mainly in regards to what separates Sizzle from other mapreduce languages like Pig Latin and Hive.

While all three are domain-specific languages designed to perform analysis of large data sets in a scalable manner, the main difference is that Sizzle is procedural, like Java or Perl, and comes with a wide array of sophisticated aggregation functions, while Pig and Hive are relational, like SQL, and come with those aggregators typically found in SQL databases.

Sizzle is based on Sawzall, which was developed at Google for scalable data analysis. In particular, it was designed to make writing mapreduce jobs that perform aggregation or statistics gathering quick and painless.

Here is a trivial example that demonstrates the expressiveness of Sizzle:


total: table sum of int;

x: int = input;

emit total <- x;


This program reads a number of integers from a file, and adds each integer to a total which is output when the program finishes.

While a Perl golfer could write the same program in fewer characters, that version would not be able to scale linearly on thousands of computers. On the other hand, writing a similar program in Java for Hadoop would require substantially more code.

We can make a slight change to the previous example to make the program perform a more advanced aggregate:


percentiles: table quantile(101) of int;

x: int = input;

emit percentiles <- x;


This program reads the same integers, but this time divides the set into percentiles and outputs the borders between them. This demonstrates the sweet spot that Sizzle occupies: quick and simple ways to write distributed analytics programs that are intuitive to those who come from a procedural programming background.

Thanks for reading, and come back later this week when I will explain a Sizzle program that does something more useful.

Friday, November 5, 2010

Announcing Sizzle, a Compiler and Runtime for the Sawzall Language

I am pleased to announce the v0.0 release of Sizzle, a compiler and runtime for the Sawzall language. Sizzle targets Hadoop directly, by compiling Sawzall programs into Hadoop job jars that can be run anywhere Hadoop is installed, without requiring a Sawzall interpreter to also be present.

Although the current release should be considered developer preview release, it is currently possible to run non-trivial Sawzall programs on a 0.21 Hadoop cluster or your local desktop.

More information here:

https://github.com/anthonyu/Sizzle/wiki

and all code is available from:

https://github.com/anthonyu/Sizzle

Please try it out, and let me know any problems you experience via
github issues, @SizzleLanguage, or comments to this blog.



Cheers,
Anthony