Apache Pig is one of my favorite programming languages. Pig is a “data flow” language — kind of a hybrid between SQL and a procedural language. It mostly resembles bash one-liners that pipe data in and out.
Pig is not being used anymore, because it was replaced by different tools and different approaches (stream processing for example).
For ETL batch workloads, it was mainly replaced by Spark, which is superior in almost any way. For the ones who like SQL-like processing there’s Spark SQL, and for those who like to code there’s Scala Spark and PySpark.
For BI workloads, languages such as Hive and Presto have a more standard SQL-like syntax which is more convenient for analytics.
So why do I miss Pig? Because it’s simple, because it can be used for both ETL and BI, and because it just works on any scale. Yes, there are better tools, but Pig is strong enough and simple enough for most needs. It even supports User Defined Functions if you really want, but if you find yourself using these nowadays then Spark is probably better for you.
A good example of just how fun Pig is, I want to show you how the classic word count looks like.
In Pig, the code looks like this:
lines = LOAD 'file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
In Spark, the code looks like this:
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF
.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
There are several differences which stand out pretty quickly:
- Word counting forces us to use a non-trivial string operation such as split. This requires special functions or writing a UDF — and in Pig we can use some default UDFs provided to us, such as
TOKENIZE. In Spark, the code-first approach tries to make you feel comfortable to code your functions within the data flow construct, so you won’t need a UDF. Yes, this means you have much more power at the palm of your hands, but the reality is that you don’t really need that much power for standard ETL.
- In Pig Latin, every line is one single operation, and will usually dumb down to simple operations such as
GROUP BY. The syntax itself is much simpler and more verbose.
- In Pig Latin, we use
FLATTENinstead of relying on how
explodeworks. The flattening action is more verbose — we take a single row, convert it to a list and then flatten it back to a single column. It becomes very clear then that a
FOREACHrow with a
FLATTENstatement returns more rows than it started out with. In Spark, using the Dataframe interface, we need to use a function that receives some rows and returns more rows, similar to
flatMapwith the RDD API, which is a bit less clear from just looking at the code.
GROUP BYstatement in Pig Latin is simpler in the sense that it doesn’t return a weird “group by result” like in Spark. It will always return a new dataset with the first column being the group key and the second column being a list of all the grouped values. In Spark, you can only use aggregate functions on group by results, such as
count. To me, I find this a bit annoying.
- The data flow itself is very similar — in both cases, you construct a data pipeline that starts out by taking rows of text, converting it to a single column of words, then grouping and counting.
So yes, I find Pig simpler and more fun than Spark. Obviously when you get used to it then it really doesn’t matter — which is why I don’t really use it anymore. But I miss it.
I want to try it out, now what?
There ya go (on Mac):
brew install pig
pig -x local