The Definitive Guide to Bash Analytics

As much as Spark and Python are great at data analysis, most of the time, as a data engineer, you just want to start working on your CSV file in the command line.

Well, as it turns out, there’s so much you can do just with Bash. And it’s SUPER fast and easy.

The data set we’ll be using is some random sales database I found online here:

http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/

If you start by looking at the file, you get something like this:

cat sales.csv | less -S

As it turns out, there’s this cool tool called column which makes your CSV file look like this:

cat sales.csv |  column -t -s "," | less -S

Let the Data Science begin

The first thing you’d want to look at is aggregations. The most basic aggregation is a count. In this specific file, we can aggregate by item type, region, country and more. Let’s say we wanted to see how many sales items there are by region. First, we look at the first column just to make sure we chose the right one:

cat sales.csv |  cut -f 1 -d "," | head

And then you’ll need to run a sort piped to a uniq -c command. Sort always must be executed before uniq for uniq to work properly, since uniq only looks at the previous line to generate the unique output. The -c command would then count each unique instead of just returning a unique list.

cat sales.csv |  cut -f 1 -d "," | sort | uniq -c

A nicer output would be to sort the list again, but this time using numerical sort with the -n flag, which knows that the first column is a number so it would get sorted properly:

cat sales.csv |  cut -f 1 -d "," | sort | uniq -c | sort -n

To remove the annoying header from the statistics there, use tail -n+2. The tail command reads all lines from the end and up until n-2, which means the entire file except for the last row from the bottom, a.k.a the first row:

cat sales.csv |  cut -f 1 -d "," | tail -n+2 | sort | uniq -c | sort -n

Conditional Filtering

Another super useful snippet is to filter only a set of rows, given a certain condition. For example, only output columns 3 and 4 if column 5 is equal to “Asia”.

We’ll do this by using Awk, a very powerful command line scripting language.

Let’s say you want only the column for how many units were sold (column 9), but only for sales in Asia.

cat sales.csv | awk -F ',' '{ if ($1 == "Asia") print $9 }'

You can also print all of the row using $0:

cat sales.csv | awk -F ',' '{ if ($1 == "Asia") print $0 }'

Counting the rows is easy:

cat sales.csv | awk -F ',' '{ if ($1 == "Asia") print $0 }' | wc -l

However, summing them up requires a bit more Awk magic:

cat sales.csv | awk -F ',' '{ if ($1 == "Asia") print $0 }' | awk '{s+=$1} END {print s}'

Obviously Awk is not something you want to write complex script with due to readability. But for quick and dirty data science, it does the job well.

Joining

Surprisingly, Unix also has a simple join tool called join which can be used to join two CSVs fairly easily.

Now let’s say you want to have a CSV file with the sales data, but joined with the population of the country.

Remember that CSV is a complex format and that bash can’t handle it properly — so if you have quotes you’ll have to remove them. And if you have commas inside the quotes, then you’ll need a more complex parser (like Python or a dedicated tool) to parse them and turn them into something that bash can handle, like TSVs without commas.

curl -O https://raw.githubusercontent.com/cristiroma/countries/master/data/csv/countries.csv
cat countries.csv | tr -d '"' > countries_clean.csv

Join is a very simple command, but like uniq it also expects the inputs to be sorted, which makes it much less readable:

join -t "," -1 2 -2 1 <(sort -k 2 -t "," sales.csv) <(sort -k 1 -t "," countries_clean.csv) 

As you might have guessed, the -1 argument is the column number of the first file to join on, and the -2 argument is the column number of the second file to join on.

Here is the joined output of the sales file with the countries file:

Histograms

Sometimes the command line is not enough for what we want to do, so we need to add a little extra Python.

Here’s a very crude but working useful histogram one-liner to show a histogram for the “unit price” in the CSV:

cat sales.csv | cut -f 10 -d "," | tail -n+2 | python -c "import sys, collections; values = [float(x) for x in sys.stdin]; min_v = min(values); max_v = max(values); norm = [int(10*(x-min_v)/(max_v-min_v)) for x in values]; print '\n'.join(['%12.4f - %12.4f: (%8d) %s' % (x[0]*((max_v-min_v)/10)+min_v, (x[0]+1)*((max_v-min_v)/10)+min_v, x[1], '*' * (100 * x[1] / len(values))) for x in sorted(collections.Counter(norm).items())])"

The possibilities are obviously limitless, but the rationale is that if you want to do simple processing and analysis on files, then you have a lot of small but useful utilities like cut, wc, awk, tr, sort, uniq, and more.

Spark, for example, is sometimes used even when small data is involved — but these tools can be used with files as large as 10GB or more to do basic operations, and that’s even without parallelism. You can use tools such as parallel and xargs to do parallel processing as well, and get the power of the likes of Spark with bash processing. A lot of times, it would even be faster — both in writing time and in execution time.

An entrepreneur, and a web expert.