I miss Pig

lines = LOAD 'file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF
.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
wordCountDF.show()
  1. Word counting forces us to use a non-trivial string operation such as split. This requires special functions or writing a UDF — and in Pig we can use some default UDFs provided to us, such as TOKENIZE. In Spark, the code-first approach tries to make you feel comfortable to code your functions within the data flow construct, so you won’t need a UDF. Yes, this means you have much more power at the palm of your hands, but the reality is that you don’t really need that much power for standard ETL.
  2. In Pig Latin, every line is one single operation, and will usually dumb down to simple operations such as FOREACH and GROUP BY. The syntax itself is much simpler and more verbose.
  3. In Pig Latin, we use FLATTEN instead of relying on howexplode works. The flattening action is more verbose — we take a single row, convert it to a list and then flatten it back to a single column. It becomes very clear then that a FOREACH row with a FLATTEN statement returns more rows than it started out with. In Spark, using the Dataframe interface, we need to use a function that receives some rows and returns more rows, similar to flatMap with the RDD API, which is a bit less clear from just looking at the code.
  4. The GROUP BY statement in Pig Latin is simpler in the sense that it doesn’t return a weird “group by result” like in Spark. It will always return a new dataset with the first column being the group key and the second column being a list of all the grouped values. In Spark, you can only use aggregate functions on group by results, such as count. To me, I find this a bit annoying.
  5. The data flow itself is very similar — in both cases, you construct a data pipeline that starts out by taking rows of text, converting it to a single column of words, then grouping and counting.

I want to try it out, now what?

brew install pig
pig -x local

--

--

An entrepreneur, and a web expert.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store