Sunday, 22 October 2017

Work around for problems with Split function when connecting Tableau to PostgreSQL

I'm a big fan of the split function when doing calculated fields. Partly because far too often, the fields from the sources I work with are concatenations of other fields, or even because I've done a union of several CSV sources and some crucial bit of information is hidden in the table name/path. But recently I tried to use this with PostgreSQL and I got the following error
  • ERROR: function split_part(text[], unknown, integer) does not exist; Error while executing the query
Now I could investigate this further on the PostgreSQL side of things but I just want to get my thing to work and move on. So instead here's the workaround I came up with:
mid([concat_str],start_of_split, find([concat_str],'/',start_of_split)-start_of_split)
where start_of_split is a fixed number of characters (maybe you want to do another find here) and '/' defines the delimiter to use when splitting.

Sunday, 20 August 2017

Tableau and negative zero

Another saga from the big data frontier: at work we have a source of GPS data, and I've been working with a colleague to aggregate it to a degree grid, not unlike the example linked. The data is in hive, and bizarrely it has two columns for each dimension, latitude magnitude as a positive float, and latitude sense as a string (N or S) etc. for longitude. To make our life simple we round the magnitude in the custom SQL that connects to hive, and we make the data signed again with a simple calculation in the Tableau source:
[round Latitude magnitude]*
(case [latitude sense]when 'S' then -1
when 'N' then 1
The rounding of the magnitude is fine as we also keep the sense in the group by dimensions in the custom SQL. The only special case here is when the rounded magnitude is zero, where had we done the sign assignment before the rounding, we'd have one bucket for zero instead of one for 0-0.5 and one for -0.5-0. But surely that shouldn't be an issue once we do the calculation above in tableau?

It turns out that it is an issue. I'm not sure what's happening in the data engine (two's complement going crazy because of negative zero?) but the two zeros are treated differently, recreated with the data of the post linked above.
 Sure enough, looking at the two zeros there's one for -0 and one for +0. So we refine our calculation to avoid multiplying zero by -1

if [round Longitude magnitude]=0 then 0 else
[round Longitude magnitude]*
(case [longitude sense]
when 'W' then -1
when 'E' then 1

Sunday, 11 June 2017

Pollster pollster, what do you know?

It's been quite a while since I put something on Tableau public. Once I saw this I couldn't resist. I can't quite get the trend line right, but let that be an exercise for the reader :)

Saturday, 27 May 2017

A green electoral trend

The bar chart below, from Green party leaflets, is very interesting. The argument here is that the share of their vote is growing and voting for them is not a lost vote. I have a lot of sympathy for the argument for proportional representation. The 2015 British national election was a joke, with UKIP being a third party in votes but not in seats, and SNP benefiting massively from first past the post to get a monopoly of Scottish seats that is not quite reflected in the vote counts. 
However, this chart poses a few questions for me. Sure, the green vote grew from 2010 to 2015, but was it a one off that cannot be projected into further growth for 2017? 
After all 2015 was the fall of the LibDems, and those voters who felt betrayed by Clegg but hadn't forgiven Labour either could vote Green. The situation of course is very different now, with Labour breaking away entirely from the Blairite past while the LibDems are trying to capitalise on the 48% Remain vote (by repeating the Guardian quote 'Corbyn betrayed us' among other things).
An equally interesting feature of this bar chart is the plateau from 2005 to 2010, which supports the argument that 2015 was an outlier and not the manifestation of an underlying trend. Last decade had very different priorities to the current one; the two big parties had largely taken on-board environmental concerns, the Tories even had a green tree as a logo (the same one that was repainted with the union jack recently), and even though the economic crisis was obvious at the end of the decade, the political priorities took a while to change. So maybe 2010 is the outlier, and the trend is there.
We'll only know once the 2017 results are out!

Sunday, 14 May 2017

The Tory deficit

So, another month, another election campaign. Cambridge is a two horse race again, this time there is no dispute on which parties are the two horses, as in the little snippet I scanned from the latest LibDem leaflet. But what about that black arrow over the blue bar? That's what I call 'the Tory deficit'. Not the financial deficit but the one of votes. Or is it?
Let's think about this. The black arrow would bring the Cons up to about 34%. Then the total would be 34+35+36=105% and that's without counting the small parties. I'm also not convinced the bars start from zero, but it's quite hard to compare the length of the arrow against the length of the bars. 
In terms of electoral rhetoric, 'Jeremy Corybn's (sic) candidate' is an interesting choice. Julian Huppert very much lost in 2015 because he was 'Nick Clegg's candidate'. Of course the Liberals are pushing the argument a bit too much by saying (in a previous leaflet that went to recycling without stopping by the scanner) 'Don't vote Tory, you might get Corbyn! 

Monday, 8 May 2017

It's the economy, stupid!

So I was listening to my BBC local station yesterday. Have you noticed how rather mediocre radio stations make a bit of an extra effort in their weekend programming? Normally this involves some specialist music shows, but BBC Cambridgeshire also has the Naked Scientists . One of the themes of the evening was language, and one of the featured scientists (hopefully not naked) was the economist Keith Chen. The fact he is not a linguistics professor is a crucial thing to note, as well as the fact that he teaches in the school of management and not in the economics department. But I digress.

Keith Chen's main point was that people speaking a language that has an explicit future tense (such as English or Greek) don't save as much money, don't take as much care of their health etc. compared to speakers of languages that don't have a future tense (aparently German is such a language). For the nuanced argument you can read the relevant paper which I have only skimmed through but hey, this is a blog, we don't take ourselves too seriously.

One of his main sources of data is the world values survey. The first thing I notice on visiting their site is the beautiful magic quadrant visualisation known as the Inglehart–Welzel Cultural Map, or occasionally the Welzel-Inglehart Cultural Map. This immediately screams 'Samuel Huntington Clash of civilisations' to me, but I haven't read that book either so I won't get carried away. Just notice how countries are bundled together in mysterious ways: Azerbaijan occasionally becomes orthodox, Israel and Greece catholic, the English speakers are of course exceptionally neither protestant nor catholic even though they could be either or neither, and the colouring does or doesn't always follow the religion, or the cluster, which contorts around accordingly.

So this wonderful source of data proves that future tense equipped languages like the ones mentioned above have speakers that don't plan for the future, and vice versa. The examples quoted included of course the UK and Greece as the worst savers in Europe. This tempted me to use the website facilities to get the table embedded below: 

TOTALCountry Code
Save money42.8%13.9%57.0%
Just get by38.4%66.3%24.7%
Spent some savings and borrowed money9.1%12.2%7.6%
Spent savings and borrowed money6.6%5.8%7.1%
DE,SE:Inapplicable ; RU:Inappropriate response; BH: Missing; HT: Dropped out survey0.2%-0.3%
No answer1.7%-2.6%
Don´t know1.1%1.9%0.8%
To me this data says one thing: People in Germany were well off at the time of the survey, and people in Cyprus were much less well off. When you have money to spare, you save, when you don't you get by, and that has little to do with your language and the way it expresses future events. It has a lot more to do with employment going up or down, banks doing well or being about to collapse, and the euro being too strong or too weak in relation to the country's economic health. In fact Chen went as far as citing Belgium, as an example of where everything else being the same, language is the only factor differentiating people. Perhaps he should check out some call record analysis proving that Belgium is really two parallel societies that meet in Brussels!

I was planning to finish on a note about the sad state of linguistic research but it would be wrong, actually the fact he is in the management school explains the unique blend of positivist prejudice displayed here.

Saturday, 6 May 2017

Histograms and data driven aggregation

Unavoidably, once you start taking your work seriously as 'data science' you have to do hypothesis testing. And to do hypothesis testing you need to know the distribution of your data. And the most intuitive way to see the distribution of your data is to plot a histogram.

So in that context, we have a go at plotting a histogram of a field in our data. The advice of our 'big data' provider is - you guessed it - pull data from Hive into a Spark data frame, do some operations, convert to RDD, do some more operations. I'm too lazy for all that so digging around I found that Hive has a histogram function. You might not like the idea as it returns an array of structures that contain the bin centres and the respective frequencies, and it uses some funky binary delimiters, different for the struct fields, the array elements and of course the fields returned by the query. This is complicated enough to merit its own post which I promise to do in the future, but in my book still preferred: No need to do 20 (or 50?) lines of configuration and functional programming where a SQL one-liner would do.

Anyway, having done that I was looking at another field for which we also needed a histogram, and realised that it is really a discete measurement, it was a number somewhere between 40 and 80 that only came with .00, .25, .50 and .75 in the decimal places. Maybe an unsigned 8 bit quantity at the point of measurement/analog to digital conversion? Anyway, that means that to do a histogram you can avoid the binning all together, the data is pretty much 'pre-binned'. Instead it becomes more like the first classic example of any Map Reduce related tutorial: a word count.  How many times does e.g. 50.25 appear in the data, and likewise for all values.

Knowing your data can always save time and effort when trying to analyse it. A key reason to like Tableau is the way it allows you to learn what your dataset looks like and explore it from all possible sides. I have to confess though, the final dataset was simple enough for the histogram to be done as a pivot-chart in Excel!