Monday 1 May 2017

Why I'm learning Pig

I've made fun of the Apache Pig project the first time I came across it, but I take it back. I now fully see its value and I'm learning how to use it. As there is a lot of ignorant discussion online and offline claiming that Pig and Hive are equivalent tools and that the difference is one of syntax between SQL like (declarative) HiveQL and scripting style procedural Pig Latin, let me explain how I got convinced of the need to use both.

I came to Hadoop gaining access to a system set up for us by a cloud provider, and a lot (but not all) of the data I'm interested in being in HDFS and Hive tables. In that situation, it's taking me a while to figure out what every part of the Hadoop ecosystem does and how it could be useful to me. Hive was the one thing that seemed immediately useful and worth learning: it had a lot of data I was interested in, it sort of follows an accessible standard language (SQL), and it offers quite powerful statistics . An initial presentation on it from the provider claimed it could be used for Hypothesis testing, Predictive Analytics etc., and while that seems a bit misleading in retrospect, Hive can provide all the statistics needed by any specialist tool that does the testing or the prediction. So far so good. I did play with Spark a few times to figure out what it is and how it works, but the barrier to entry there seemed definitely higher: you have to worry about cluster configuration, memory etc. when you launch jobs and you have to use a lot of low level code (RDDs, closures etc.)

One of the knowledge exchange sessions with the provider was on Hadoop being used for ad hoc analysis. Their suggested process was: copy data to HDFS, load data into newly created Hive table, load data from Hive table into Spark dataframe, do certain operations, convert to RDD, do more operations. It seemed awfully complicated. When there was a need to do such analysis, I realised I needed to define a 44 column table schema when I only wanted to average one column grouped by the contents of another, and gave up on using Hadoop at all for the task. It bothered me that I didn't know how to do something this simple on Hadoop though, so I kept reading through books and searching online until Pig emerged as the obvious solution. The syntax for what I wanted to do was ridiculously easy:
file_data = LOAD 'hdfs://cluster/user/username/file.csv' USING PigStorage(',');
raw_data = FILTER file_data by $0!='field0_name';
data_fields = FOREACH raw_data GENERATE $11 AS file_dimension,  (int)$43 AS file_measure;
data_group = GROUP data_fields by file_dimension;
avg_file_measure = FOREACH data_group GENERATE group,AVG(data_fields.file_measure) AS file_measure_avg;
This example embodies certain aspects of Pig's philosophy: Pigs eat everything, without necessarily requiring a full schema or being particularly difficult about the delimiter field, or the presence of absence of the csv header (which I filter out in the second line of the example). Pig can go even further working with semi-structured and unstructured, non normalised data, that would be entirely unsuitable for Hive without serious processing. Pigs are domestic animals and rather friendly to the user. One of the early presentations on Pig stated that it "fits the sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce". I would dare say that this statement could be updated for the Hadoop 2 world with Spark in place of MapReduce, so it is unsurprising that Pig is still heavily used for ETL and other work on Hadoop, and Pig on Spark is in the works (hopefully delivering on the Pigs fly promise). A final point that Pigs live anywhere should comfort anyone worried about learning such a niche language: it is also supported e.g. on Amazon EMR.

So in retrospect: An organisation can adopt Hadoop and throw all its data into a 'data lake' in HDFS. Any competent programmer in that organisation can then use an array of programming approaches (Pig, raw MapReduce, Spark) to analyse this data, some faster to program, others more powerful but requiring more programming effort. This is the fabled 'end of the data warehouse' but only possible if the users of the data can do their own programming. If on the other hand the organisation wants to enable access to the data to non programmer analysts, connect standard BI tools to the data etc. then they adopt Hive, but have to do a lot of the same work that is required for a traditional data warehouse: ETL, normalisation etc. The main advantage of Hive compared to traditional DWH is being able to cope with Big Data that would ground an RDBMS to a halt. In most cases probably a happy medium is reached where key data is in Hive tables, but a lot of other 'niche' data stays in non-structured or non-normalised formats in the data-lake. I have not addressed where NoSQL databases fit into this picture, I promise to come back on the subject when I have a similar NoSQL epiphany.

No comments:

Post a Comment