Apache Pig is one of my favorite programming languages. Pig is a “data flow” language — kind of a hybrid between SQL and a procedural language. It mostly resembles bash one-liners that pipe data in and out.

Pig is not being used anymore, because it was replaced by different tools and different approaches (stream processing for example).

For ETL batch workloads, it was mainly replaced by Spark, which is superior in almost any way. For the ones who like SQL-like processing there’s Spark SQL, and for those who like to code there’s Scala Spark and PySpark.

For BI workloads, languages such as Hive and Presto have a more standard SQL-like syntax which is more convenient for analytics.

So why do I miss Pig? Because it’s simple, because it can be used for both ETL and BI, and because it just works on any scale. Yes, there are better tools, but Pig is strong enough and simple enough for most needs. It even supports User Defined Functions if you really want, but if you find yourself using these nowadays then Spark is probably better for you.

A good example of just how fun Pig is, I want to show you how the classic word count looks like.

In Pig, the code looks like this:

lines = LOAD 'file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;

In Spark, the code looks like this:

val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF
.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()

There are several differences which stand out pretty quickly:

So yes, I find Pig simpler and more fun than Spark. Obviously when you get used to it then it really doesn’t matter — which is why I don’t really use it anymore. But I miss it.

I want to try it out, now what?

There ya go (on Mac):

brew install pig
pig -x local

An entrepreneur, and a web expert.

An entrepreneur, and a web expert.