Monday, November 8, 2010

What makes Sizzle different?

Since the announcement last Friday, I have gotten a few questions about Sizzle, mainly in regards to what separates Sizzle from other mapreduce languages like Pig Latin and Hive.

While all three are domain-specific languages designed to perform analysis of large data sets in a scalable manner, the main difference is that Sizzle is procedural, like Java or Perl, and comes with a wide array of sophisticated aggregation functions, while Pig and Hive are relational, like SQL, and come with those aggregators typically found in SQL databases.

Sizzle is based on Sawzall, which was developed at Google for scalable data analysis. In particular, it was designed to make writing mapreduce jobs that perform aggregation or statistics gathering quick and painless.

Here is a trivial example that demonstrates the expressiveness of Sizzle:

total: table sum of int;

x: int = input;

emit total <- x;

This program reads a number of integers from a file, and adds each integer to a total which is output when the program finishes.

While a Perl golfer could write the same program in fewer characters, that version would not be able to scale linearly on thousands of computers. On the other hand, writing a similar program in Java for Hadoop would require substantially more code.

We can make a slight change to the previous example to make the program perform a more advanced aggregate:

percentiles: table quantile(101) of int;

x: int = input;

emit percentiles <- x;

This program reads the same integers, but this time divides the set into percentiles and outputs the borders between them. This demonstrates the sweet spot that Sizzle occupies: quick and simple ways to write distributed analytics programs that are intuitive to those who come from a procedural programming background.

Thanks for reading, and come back later this week when I will explain a Sizzle program that does something more useful.

No comments:

Post a Comment