Thursday, November 11, 2010

The Anatomy of a Sizzle Program

Now is probably a good time to discuss the syntax of Sizzle programs. While you saw a little bit of syntax in the previous post, in this one will describe a more useful program.

This program is called "best_referers.szl" and is available in the source and binary distributions under the top-level directory "examples." It takes as input the log files from a web server and outputs the top three referers for each page that had been requested from the web server.

This first line has a lot going on. First, declare a 'top' aggregator, which filters out all but the top n values emitted to it.  It takes the name "best" for later reference, and the "3" in parentheses next to the name specifies that we are only interested in the top three results from all values emitted to it. Next, the words inside the square brackets declare that "best" will be indexed by a string, in this case the URL of the page. Finally, the last clause indicates that the values emitted will be strings weighted by an int, in this case the count.

best: table top(3)[string] of string weight int;


The aggregator is the main output point for Sizzle programs. In essence, they are just prefabricated reducers, doing some specific aggregation on the values emitted to them, then writing their output to disk.

The next line is simpler, it just declares a string named "line" to hold the input value.

line: string = input;


In Sizzle, a program works on a single value at a time. Just like a Hadoop Mapper class, the program doesn't have access to the previous or next value, or any global state. This constraint allows for many optimizations that would otherwise be impossible.


Next, use the 'saw' function to parse the line. 'saw' uses a variable number of regular expressions to chop a line into an array, sort of like any other languages' 'split' function on steroids. In this case, chop it and store it in an array of strings called "fields."

fields: array of string = saw(line, ".*GET ", "[^\t ]+", " HTTP/1.[0-9]\"", "[0-9]+", "[0-9]+", "\"[^\t ]+\"");


'saw' regexes arePerl Compatible Regular Expressions, so their syntax will be familiar to most programmers. In this case we are chopping Apache combined logs, so the 'saw' function will work in one call, whereas doing the same in Perl or Java 'split' would be substantially more complicared.


Finally, just emit the URL, stored in "fields" at index 1, and its referer, stored at index 5, to the "best" table with a count of one.

emit best[fields[1]] <- fields[5] weight 1;



Running this program on a web server's logs outputs a text file with summaries of the top three referers for each page, something like this:

best[http://example.com/] = -, 3984
best[http://example.com/] = http://google.com/search?q=example, 1587
best[http://example.com/] = http://bing.com/search?q=example, 1233
best[http://login.example.com/] = http://example.com/, 3984, 0
best[http://login.example.com/] = http://login.example.com, 1282
best[http://login.example.com/] = http://login.example.com/lostpassword, 470


Stay tuned; in the next post, I will discuss in detail how to run a Sizzle progam and retrieve its output.

No comments:

Post a Comment