In a previous post I talked a little about the type of analytical work I used to do in my previous life as a consultant in Cluster (then DiamondCluster, then Mercer, now OliverWyman). Most of this work was done using SAS, an statistical package I personally dislike.
The reasons for using SAS were mostly historic. It was used by one of our first clients in this area, and then a certain myth develops around the tool. It has been proved that works so... it has to be used. Consultants are a little bit risk-averse.
The whole concept is built around disk-based datasets, you read them, you sort them, you tranpose them... and you always move from a dataset to another dataset. Actually it is not so different from a relational database, and for all my disliking, SAS provides data manipulation capabilities that are at the very least similar or quite often better, if harder to use, than any database you may find.
That explains why people are using SAS and they are quite happy with it. And they have sound reasons for that. SAS provides with unparalleled ability to perform all type of manipulations and statistical operations on very large datasets. In the right environment, that is with the right people like some of the ones I met in the US, who were extremely knowledgeable about Statistics and had spent a lot of time programming in this environment, you have a winning combination.
But I still dislike the tool for three main reasons:
-I find it error-prone (syntax is horrible), and I would not like to be judged on all the SAS code I had to produce
-It is extremely programmer unfriendly (did I mention that the syntax is horrible) it is always really hard to debug code written by somebody else, especially people not trained in some basic programming practices, but SAS makes it even harder.
-SAS is not for casual programmers, if you are hiring people to offer them a career in consultancy they will be burnt if you force them into this. And finding good SAS programmers for contracting is not easy. To me this is a good enough reason to look for alternatives.
So if I am to replace SAS I need to find something that provides me similar flexibility and a more productive environment. And I have to say that so far I have not found it in one package. I am in the process of building it using what is available.
My current setup involves three different elements:
-A database, currently MySQL but will probably move to Postgres
-A statistical package, currently R . As good as SAS and I like the fact that I can focus on the statistics and forget the data part (done in the database)
-An external programming environment to act as a glue. I am currently using AppleScript and Perl (actually the PERL part has been done by one of my colleagues, Sergio). If I feel geeky enough I may even go to Objective C on my Mac.
My bet is that will be easier to find people who know how to deal with SQL databases (a commodity nowadays) and a language like PERL. R is also quite popular. I also like the fact that all are publicly available software giving me a broad pool of talent to tap (and access to the sources if required). Also I love the fact that all of this is running on my Mac, a much nicer environment to work. I know people think Apple machines are mostly for "creative" guys who do media production. Actually they can be used as a scientific workstation quite easily. This is much better than processing this in MS Access as I had to do a few times.
Of course my laptop is not the best environment to run a database server of several gigabytes, but this is easily scalable.
I will take a look at how the different elements are doing on a later post.