When it comes to analyzing data sets and finding insights for decision makers, we all start with Excel or Google Sheets. However, they both have serious limitations. Neither can handle “big data” and you can’t replicate your formulas with new data without rebuilding your source data or your formulas.
Luckily for us, statisticians faced this problem long ago and developed two different Open Source software solutions to analysis large data sets.
Enter R and Python
R and Python are two of the most loved programming languages for analyzing large data sets and building replicable data models.
R was developed in 1992 and was the preferred Open Source programming language of data scientists because it was developed explicitly by statisticians for data analysis and it could replace expensive proprietary systems like SAS. R is a procedural language like BASIC or Pascal (remember those?!) and works by breaking down programming tasks into a series of steps, and subroutines.
Python was released in 1989 as Open Source software with a focus on code readability and efficiency. Unlike R, it’s a multi-purpose object-oriented programming language, which means it groups data and code into objects that can interact with and even modify one another, like Java and C++. Because Python is multi-purpose, organizations use it for more than just data analysis – YouTube, DropBox, Spotify, and even Google are all based on Python code.
Which Should You Use: R or Python?
After reading a Quartz article on R vs. Python in mainstream data analysis, I wondered which program would be more popular in the international development industry? And from that, which language would be more beneficial for an emerging data scientist or software developer to learn, and for us all to use?
To answer these questions, I turned to four noted experts in data analysis. Here are their responses, lightly edited for clarity:
R and Python have different strengths. R is much easier for heavy stats work, and people from a science background tend to have an easier transition into it. Python is useful for much much more than just data analysis, is easy to teach, and is becoming a more heavily used language because you can use your code straight up for big web apps without converting to another language.
I teach in python because that’s where the jobs (and the future jobs) are. I also teach the basics of R for those people whose brains work that way.
We often don’t have much of a choice between R vs. Python because there are strong network effects in collaboration and supervision. For example, when I walked into my PhD supervisor’s office in 2006, it was all Stata in economics and political science in their network, so I learned Stata.
However, my choice is now Python for everything else, especially web apps and machine learning. Python is easy to code and it’s very ergonomic, while R’s syntax is the subject of many rants.
I would consider what’s used by your field or sector, your professional community, or your geographic community.
Beginners might under-appreciate how collaborative/social coding has become. If all the technical experts you know have built tools in R, it could be worth adopting R so you can tap those resources. In Washington, DC, data meetups are heavy on Python users, so if that’s how you want to learn/work, it’s a great choice. In Boston, I’d bet they’re R users.
Both languages have remained relevant because they have large/devoted online communities that contribute to the bodies of resources for using them, so choosing a language is also choosing which group will become your collaborators.
Generally, I agree on all of the points made so far, particularly around picking the right tool for the right use case. If you’re an analyst that already knows R well, you may often find that’s the most practical for analytical tasks. If you’re a developer who uses Python already, you would be better building expert knowledge in that space.
Regardless of which tool you choose, one of my grad school professors gave us great advice: You should invest time in learning one package really well, rather than having general knowledge of all of them. Also, consider the domain where you expected to work, because like was mentioned above, the network effect is very strong with either software.
And the Winner Is…
In the immortal words of Linda Raftree, “it depends.” Which language you choose depends on the field you want to work in, the company you work for, even the team you work on, and which tool they use.
If you don’t have that clarity yet, Python seems to have the edge in international development, but I’m sure I’ll get an R defender in the comments.
I love seeing ICTWorks getting a bit more technical with a post like this! I know that 80% of ICT4D is design and facilitation, but it is nice to discuss stuff like this a bit more.
In most spatial analysis things I have worked on we have used both. Each for different steps. R is great for data cleaning.
Python. Because it will take you places R would. You will walk proudly with coders, and anyone who does not know that walking with coders is a basic human right is not worth listening to. kikikiki. Bust seriously, Python.
Sorry, where R would not!