R vs SAS (Comparison and Opinion)

Background

PC or Mac, Windows or Linux, Intel or AMD, we geeks simply love comparing things. This particular comparison although not known in popular culture is an oft repeated argument in the Analytics industry.

SAS needs no introduction, for those who need one can check out the Wikipedia article as well as LearnAnalytics SAS training section.

R or rather the R Statistical package very simply put is the open source equivalent of SAS, for what it’s worth R can pretty much do everything SAS can do in terms of Statistical analysis and there are some pretty cool things R can do which SAS can’t. Say you want to build a predictive model using Logistic regression, well R can do it; ARIMA model, yes; Decision Trees, yes; Association rule mining,yes;etc etc…..

Anything you envisage using SAS STAT for statistical analysis and data mining, R can do it.

What makes R Special?

So what if R can do everything SAS can, there are others also like SPSS, Statistica and so on which can also do pretty much what SAS can do.

Yes, but are the other software’s free? Therein lies the crux behind the whole argument, R is free, it’s an Open source project initially started in New Zealand and is now considered as one of the best Statistical analysis tools in the world.

What’s the argument, isn’t R always better?

It’s not that simple, Linux can do everything Windows can and more, but Windows still dominates.  One of the biggest reasons for continued Windows dominance is momentum and an easier user experience. Inspite of all the advantages Linux offers (better security, no viruses, comparable user experience especially in the Ubuntu variants), the common man still prefers Windows, not to say Linux doesn’t have its die hard following and a vibrant support community.

Same goes for R, now I have used both SAS and R extensively and am going to discuss the pro’s and cons of both packages below.

Statistical Capability

 SAS Stat and other SAS packages pack a powerful punch and cover almost the whole gamut of statistical analysis and techniques. However since R is open source and people can submit their own packages/libraries, the latest cutting edge techniques are invariably released in R first. To date R has got almost 15,000 packages in the CRAN (Comprehensive R Archive Network – The site which maintains the R project) repository.

Some of the latest techniques such as GLMET, RF, ADABoost are available for use in R but not in SAS. Many experimental packages are also available in R. Infact in most Kaggle competitions (which requires a blog post of it’s own), the winners (who are amongst the world’s best data  miners) have almost invariably used R to build their models.

In this aspect R is the hands down winner, however a word does need to be put in about SAS, since SAS is a paid software with support, any new innovation, or new statistical technique has to be vetted and accepted. SAS is used in many mission critical assignments where merely experimental techniques cannot be allowed to creep in. While this is necessary for the environment SAS works in, it also means that it will keep playing catchup with R in terms of latest innovations. On the other hand since anybody can upload a package in R, user beware!

Therefore in terms of pure statistical capabilities, I rate R higher.

3.     Data Handling

Data handling is the bugbear of R. The single largest drawback of R is the way it allocates and handles memory by trying to load the whole dataset in RAM. This can cause severe problems when working on a combination of large datasets and small computers (which it always is, your data is always huge and your computer is always puny!).

SAS excels in handling large datasets, infact server editions of SAS can chew through TeraBytes of data without any issues whereas R is very likely to throw Out of memory errors or become unresponsive and die.

Not to say that R cannot handle big data, it can, but say I have a Laptop with 2 gigs of RAM and a dataset running into millions of records, for the same exercise which SAS can do in 30 seconds, R might take upto a few minutes or even die.

However computing power is cheap and getting cheaper by the day, given enough RAM and computing power, R can also crunch through large datasets efficiently, especially on 64 bit machines.

But for now in terms of Data handling, I rate SAS higher.

4.     Ease of Use

One of the biggest reasons Linux has never been the runaway success as compared to Windows is that it was so damn difficult to use, install or troubleshoot. Now take that problem and multiply by 10, and you get the idea of R. There is no easy way to put it, but R is not for the faint of heart. It is damn difficult to learn as compared to SAS.

SAS programming syntax can be considered as a high level language which is intuitive and easy to learn, additionally it was designed as a DML (Data Manipulation Language). On the other hand R programming is a monster.

For e.g. consider you have to do a simple data manipulation task such as sorting a few tables and joining them together. It would be a piece of cake to do this on SQL (any SQL package or even PROC SQL) or any of the SAS data steps. Now consider doing this in C++ (makes your blood run cold doesn’t it).

If SAS programming is high level more akin to SQL , then R is a low level language closer to C++. Even simple tasks can mean writing lengthy pieces of obfuscated code.

Learning R is definitely more challenging than SAS, but since R is a true programming language it gives more flexibility and power than SAS to the programmer. But for mere mortals like the rest of us, we would prefer to use the SAS programming language.

Support for R is another issue; obscure errors messages can literally suck the life blood out of somebody who is fairly new to R. There are support groups and forums on the internet, but if you are using a new package and it throws and error, you are on your own.

All in all, for true programmers R is closer to the heart but for the rest of us, who just want to get our work done, SAS is the winner by a mile in terms of ease of use.

 

Recommendation

I have used both R and SAS, and there is no straightforward answer to this. For example even though R is free, technically it should be cheaper to use shouldn’t it? Well the answer is not always.

TCO (Total Cost of Ownership) of using R might actually go higher than SAS. For example an Analytics company decides to use R exclusively figuring since they don’t have to pay for SAS licenses, their cost of project delivery will go down, better profit margins, lower billing to client, better competitiveness in the market. Win –win right?

Except now they have to train their consultants on R, or hire outside talent. R programmers are in short supply (esp. in India), this drives up your cost of resources for one. Now take into account the learning curve and the deployment cost as well as code migrations of client legacy systems, now to mention the obscure tantrums that R can throw it you, but you can’t call anyone for support now, since there is none, it’s free software. At least if SAS doesn’t work, you can hold them by their throats. (For the kind of licensing fee they demand, it’d better work!)

On the other hand, you have a startup, small team, really smart people. Investing in a SAS license may not make sense at this point, they will simply use what I call the RUM stack (R-Ubuntu-MySql), it’s a pun on the LAMP stack.

i.e. use MySQL for heavy data manipulation, use R only for statistical analysis on machine running on Ubuntu Linux. Everything for free! While this solution may work for a small company and high calibre programmers, it is not scalable for a 25,000 man consulting organization which is run by processes/adherence and not individual brilliance.

My choice -> if you are small and hungry go for R. If you are a big organization where budget is not an issue, close your eyes and buy SAS licenses, everybody will be happy (but install R on your laptop nonetheless).