Showing posts with label recipe. Show all posts
Showing posts with label recipe. Show all posts

Thursday, November 11, 2010

The Anatomy of a Sizzle Program

Now is probably a good time to discuss the syntax of Sizzle programs. While you saw a little bit of syntax in the previous post, in this one will describe a more useful program.

This program is called "best_referers.szl" and is available in the source and binary distributions under the top-level directory "examples." It takes as input the log files from a web server and outputs the top three referers for each page that had been requested from the web server.

This first line has a lot going on. First, declare a 'top' aggregator, which filters out all but the top n values emitted to it.  It takes the name "best" for later reference, and the "3" in parentheses next to the name specifies that we are only interested in the top three results from all values emitted to it. Next, the words inside the square brackets declare that "best" will be indexed by a string, in this case the URL of the page. Finally, the last clause indicates that the values emitted will be strings weighted by an int, in this case the count.

best: table top(3)[string] of string weight int;


The aggregator is the main output point for Sizzle programs. In essence, they are just prefabricated reducers, doing some specific aggregation on the values emitted to them, then writing their output to disk.

The next line is simpler, it just declares a string named "line" to hold the input value.

line: string = input;


In Sizzle, a program works on a single value at a time. Just like a Hadoop Mapper class, the program doesn't have access to the previous or next value, or any global state. This constraint allows for many optimizations that would otherwise be impossible.


Next, use the 'saw' function to parse the line. 'saw' uses a variable number of regular expressions to chop a line into an array, sort of like any other languages' 'split' function on steroids. In this case, chop it and store it in an array of strings called "fields."

fields: array of string = saw(line, ".*GET ", "[^\t ]+", " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");


'saw' regexes arePerl Compatible Regular Expressions, so their syntax will be familiar to most programmers. In this case we are chopping Apache combined logs, so the 'saw' function will work in one call, whereas doing the same in Perl or Java 'split' would be substantially more complicared.


Finally, just emit the URL, stored in "fields" at index 1, and its referer, stored at index 5, to the "best" table with a count of one.

emit best[fields[1]] <- fields[5] weight 1;



Running this program on a web server's logs outputs a text file with summaries of the top three referers for each page, something like this:

best[http://example.com/] = -, 3984
best[http://example.com/] = http://google.com/search?q=example, 1587
best[http://example.com/] = http://bing.com/search?q=example, 1233
best[http://login.example.com/] = http://example.com/, 3984, 0
best[http://login.example.com/] = http://login.example.com, 1282
best[http://login.example.com/] = http://login.example.com/lostpassword, 470


Stay tuned; in the next post, I will discuss in detail how to run a Sizzle progam and retrieve its output.

Monday, November 8, 2010

What makes Sizzle different?

Since the announcement last Friday, I have gotten a few questions about Sizzle, mainly in regards to what separates Sizzle from other mapreduce languages like Pig Latin and Hive.

While all three are domain-specific languages designed to perform analysis of large data sets in a scalable manner, the main difference is that Sizzle is procedural, like Java or Perl, and comes with a wide array of sophisticated aggregation functions, while Pig and Hive are relational, like SQL, and come with those aggregators typically found in SQL databases.

Sizzle is based on Sawzall, which was developed at Google for scalable data analysis. In particular, it was designed to make writing mapreduce jobs that perform aggregation or statistics gathering quick and painless.

Here is a trivial example that demonstrates the expressiveness of Sizzle:


total: table sum of int;

x: int = input;

emit total <- x;


This program reads a number of integers from a file, and adds each integer to a total which is output when the program finishes.

While a Perl golfer could write the same program in fewer characters, that version would not be able to scale linearly on thousands of computers. On the other hand, writing a similar program in Java for Hadoop would require substantially more code.

We can make a slight change to the previous example to make the program perform a more advanced aggregate:


percentiles: table quantile(101) of int;

x: int = input;

emit percentiles <- x;


This program reads the same integers, but this time divides the set into percentiles and outputs the borders between them. This demonstrates the sweet spot that Sizzle occupies: quick and simple ways to write distributed analytics programs that are intuitive to those who come from a procedural programming background.

Thanks for reading, and come back later this week when I will explain a Sizzle program that does something more useful.