A comparison between R and Python for Data Science
The point of view of an experienced R user and Python novice
By Davide Pagin
I had to learn Python from scratch when I was already an experienced R user. This article is for everybody who knows at least one of these programming languages but is curious about the other. It’s always hard to switch from your preferred programming language, but sometimes you do it for your personal interest or for work needs.
Python and R are among the most well known programming languages for data science. In some respects, they are similar languages, but there are differences. The aim of this article is to explain the basic differences that you will encounter moving from one of these programming languages to the other.
Data science, being a very broad concept and including different fields of work, makes it difficult to examine all the differences between R and Python in a single article. For this reason, let’s focus on the main differences that a novice will encounter when changing programming languages:
- work environment
- essential libraries you must know
- different types and structures of data
- visualization of data
- operation with dataset
- machine learning field
An important concept you should understand before we dig into the details of Python and R is the general purpose of each programming language: Python is easier to understand and more suitable for people that are used to programming, but R is better for statistical analysis and research purposes. Both of them are optimal tools to perform data science. What works better for you depends on your background, so if you have partial knowledge of one of the languages, then my advice is to improve your abilities in that language rather than learning the other one from scratch.
Lastly, you should know one of the most important differences between R and Python: how you count elements. Python starts to count from 0 (typical in the field of informatics) and R starts to count from 1 (more typical in the field of statistics).
So let’s get started. Here are the six key data science fields where the choice of the programming language become crucial:
WORK ENVIRONMENT
For R, the principal integrated development environment (IDE) is RStudio. There are other IDEs for R, but most of the community use RStudio. The advantage of RStudio is its user-friendly and well-structured graphical interface. In fact, the RStudio desktop application (there is also the web version) is divided into four quadrants where you can manage your R script, write inside the console, inspect and control your variables or datasets, see your directory files or the graph you have created, and many other operations.
For Python, there isn’t a unique principal work environment. You have a wide range of possibilities such as Pycharm, Komodo, and Eclipse; however, the most similar IDE to RStudio is surely Spider. Being a habitual user of RStudio, I found Spider to be a perfect IDE to start learning Python. They are very similar across all functionalities, but I want to emphasize a strong point for each of the two IDEs. Installing new packages in RStudio is more user friendly than Spider because you can do it inside RStudio without using the Command prompt. Instead in Spider you have the possibility to launch more consoles in the same work environment. For those of you who really love using RStudio, you also have the possibility to manage Python scripts directly in RStudio by installing the reticulate library (more information on this later) and choosing Python script as a new file.
LIBRARIES
There are some essential libraries you can use for Python to become a good programmer in the field of data science:
- pandas to manage the dataset
- numpy to handle arrays or matrices
- scikit-learn to use machine-learning tools
- matplotlib, seaborn or plotly for data visualization
R is more flexible and allows the user to perform many operations without installing libraries. Nevertheless, the ggplot2 library is considered by many data scientists to be the best package to use for data visualization, and it could be considered the equivalent of Python matplotlib and seaborn. Furthermore, you could reproduce the typical numpy functions without the libraries by using some useful commands.
Scikit-learn has been a beautiful discovery with Python. With R, I was used to importing a different library for every machine learning model, whereas with Python, I started to use this powerful package where you can find a lot of useful tools and most of the ML models. Also with R, there is a very broad package similar to scikit-learn, called caret, but in my opinion it seems less structured than scikit-learn, and usually, some operations, such as splitting dataset in training and test set, can be made with simple commands without using libraries.
Pandas is another fundamental library to manage datasets with Python. Dplyr is a similar library used with R, but I’ve always felt comfortable performing those tasks with native R commands. One special mention is the tidyverse library for R. It is a package of all the essential libraries, including the already mentioned dplyr, ggplot2 and many more, related to data science.
TYPE OF DATA
Both Python and R are object-oriented programming languages, which means that every element we create is an object, and consequently, every object has its own attributes and properties. Furthermore, every object is an instance of a class, which could be considered as a blueprint of the object (for more informations of class-based programming we recommend checking its wikipedia page). From this programming paradigm, every object has a type, which is the way the compiler interprets the object, or in simple words, the type represents the nature of the object.
This is a crucial concept where R and Python have different approaches, so it is essential to know how data types differ between the two programming languages. I can assure you that this is one of the first difficulties that you will encounter if you plan to move from R to Python.
The principal data types in R are:
- Character
- Numeric (real or decimal)
- Integer
- Logical
- Complex
https://en.wikipedia.org/wiki/Class-based_programming
Then, there are data types that correspond to data structures. The most important data structures are the following:
- Atomic Vector: This can be considered as a collection of elements of the same type. If you try to create a vector with elements of mixed type, then R automatically converts them in a unique type. For example, if you create a vector with the character “a” and the number “2”, then the resulting vector will be a character type with “a” and “2” the correspecting elements.
- Matrix: This object could be considered as an atomic vector with multiple dimensions, and also in this case the elements must be of the same type. The vector and matrix objects in R are the exact representation of the analogues concept in Algebra.
- List: This element is a sort of “special” vector; in fact, a list could contain elements of different data types.
- Data Frame: This data structure is probably the most important one. In fact, in Data Science we study and try to discover information inside datasets. Data Frame is the R representation of a dataset and it’s a “special” list where every element inside the list has the same length.
- Factors: This is a crucial way to organize data in a qualitative form. Factors are character data which have been categorized. A factor variable is a variable with different levels, and every level corresponds to one characteristic.
In my mind, the factor’s data structure is a strong point for R. The different ways that Python uses to categorize data are not as efficient as the factor command in R.
For R users, there is a particular package (reticulate) that you can use to discover the equivalent of R data structures in Python:
library(tidyverse)
library(reticulate)
x <- c(12, 42, 0.6, 3.17)
r_to_py(x) %>% class()[1] "python.builtin.list" "python.builtin.object"
Python has more data types than R because it has other data structures which are closer to informatic language rather than a statistical one. In my opinion, the level of specificity in Python for data structure is very accurate and well done; however, from a statistical point of view the data structures present in R are sufficient to perform data analysis. Here you can see the principal data types in Python:
- Text Type (str): This type is equal to R character and can also be considered as a sequence type
- Numeric Types (int, float, complex): This type corresponds to integer, numeric and complex R types
- Boolean: This is a Python type that is the equivalent of R logical and helps when you want to compare two different conditions or evaluate the truthfulness of an expression by returning the TRUE or FALSE values
Then, there are the types that refer to collections of data, the principle ones being list, tuples, set and dictionary. These types are defined by some characteristics. In fact, they may or may not be ordered, changeable or allowed duplicate values. Ordered means that the data collection has a defined order and if you put a new element in the sequence, it will be placed at the end of the sequence. Changeable means that we can modify the sequence of data by adding, removing or changing the items that compose the sequence. When the data collection is ordered, then we refer to “sequence type”.
- List: This type of data structure is similar to R list because both of them allow mixed types and it’s used to store multiple items in a single variable. Lists are ordered, changeable and allow duplicate values.
- Tuple: This sequence type also allows you to store different items in a single variable, but differently from lists, tuples aren’t changeable. In case changeability is not needed, tuples should be preferred to lists because they are faster to create and query. It’s difficult to identify a similar data structure in R.
- Dictionary: This collection of data is used when we refer to data in key:value format, for example when we have a set of features or variables and for each one we have a different value. Dictionaries are ordered (from the version of Python 3.7) and changeable but they don’t allow duplicates. In fact, it’s not possible to create two elements with the same keys. Dictionaries are also similar to R lists, when these last ones are created assigning a name for each element.
- Set: This data structure differs from the others because it is unordered, so if you use a set, its items could compare in a different order. Sets are also changeable and don’t allow duplicate values. Set is the data collection type that I appreciated the most when I began to learn Python. In fact, you can’t find a similar data structure for R. At most you can create a vector and then select the unique values, but personally, I think that the set object is much more intuitive and easy to use. In case you need to work with unique values, sets should be preferred to lists and tuples as they are faster and more robust (i.e., their query speed corresponds to O(1)).
These are the principal and more utilized types which are essential to understand the Python world. There are a few more that are part of built-in types in Python. For example, binary sequence types and range types, that become useful for operations as looping a specific number of times in for loops. In the official Python site, you can find a broad documentation about the principle functionality of Python types. In addition to that, there are new types that you will find importing libraries such as Pandas or Numpy. An example is the Pandas DataFrame which is comparable to the Data Frame type in R.
DATA VISUALIZATION: GGPLOT2 VS SEABORN
One of the main debates between Python and R is: Which software is better for data visualization? As far as I am concerned, it could be R, and in particular the package ggplot2, which could be considered more complete in order to obtain more appealable graphs.
This topic probably deserves to be written as a separate article to make a complete comparison between ggplot2 and seaborn with a full analysis of their style in different graphs. For the sake of this article, we will show two different graphs made from ggplot2 and seaborn, so we can see the style differences between them.
https://docs.python.org/3/library/stdtypes.html#ranges
To create these graphs, I used a dataset in Kaggle about movies and TV series characteristics present in the IMDB site. After some preprocessing operations, I finally created the dataset that I needed by inserting a new variable which defines if the genre of the movie or TV series is a comedy or horror. The intent isn’t to show some pattern in data but only to show aesthetic differences between the graphs. In both cases, I performed some operations to enhance the visualization (eg, size/name of axis and title, size of points, etc), however, I did not change the key characteristics of the graphs (eg, the grey background of ggplot2 or the position and style of legend). For this last element, it seems interesting how ggplot2 automatically places the legend outside the graph while seaborn keeps it inside the graph.
So, the graphs below are created using the same data. The first figure uses the ggplot2 in R and the second uses seaborn in Python.
STANDARD EASY GRAPH MADE WITH GGPLOT2 IN R
https://www.kaggle.com/bharatnatrayn/movies-dataset-for-feature-extracion-prediction
STANDARD EASY GRAPH MADE WITH SEABORN IN PYTHON
To mark the differences between the two libraries, we could make a more sophisticated graph by adding names of movies or TV series episodes inside the graph. I chose to insert only the top six rated movies to avoid overlapping of text and I tried to make the two graphs of ggplot2 and seaborn as similar as possible while leaving the original background style of the two packages. These are the results:
STANDARD COMPLICATE GRAPH MADE WITH GGPLOT2 IN R
STANDARD COMPLICATE GRAPH MADE WITH SEABORN IN PYTHON
Note: At the end of the article, you can find the code used to create the final dataset and make the graphs.
MACHINE LEARNING AND OPERATIONS IN DATASET
Machine learning is one of the core applications for which R and Python have been invented, and to me, as a data scientist, it represents my field of work. For this reason, I will try to explain the crucial differences between using machine learning with Python vs. R.
The first important thing is dataset management. As we mentioned before, the Pandas library for Python is essential to perform operations on the dataset. For R, there is the dplyr package that does almost the same thing, but with R, there are some interesting things that Python doesn’t do automatically. For example, if you want to choose a particular column in the dataset and you are not sure about the name, you could use the name of the dataset followed by the “at sign” (@). A curtain/search field would appear with all the column names and you could choose the one you had in mind. If you have a lot of variables, you could type some letters of the column and you would see only columns with that group of letters. On the contrary, Python helps you train your memory’s ability, which means you have to remember the column name. If you do not remember the name, then after installing the Kite plug in Spider, you have the possibility to use the dot sign after the dataset name (eg, dataframe.column_of_interest); however, in this case, it does not work as the “at sign” in R. In fact, the curtain field only appears after typing some letters. Furthermore, you won’t see only the dataset’s columns but also the attributes of your dataset object. I personally think this could be annoying especially when you have to deal with large datasets.
Then, after cleaning and exploratory data analysis, you will move to the choice of the correct model for your data, which is the core part of machine learning. For both R and Python there are plenty of tools, with a lot of functions to apply, for different problems such as regression, classification, and clustering. We can find all the functions we need in the caret package for R and in the scikit-learn library for Python. However, as I mentioned before, in my personal experience I learned how to use the tools to perform machine learning without using caret. Basically, for each model, I imported the library I needed and I performed all the other operations with native commands. On one hand, this way of doing machine learning, every time making personally the functions you need (such as creating ROC curve or confusion matrix in classification) could be very time-consuming and I think it is an inefficient way to operate. On the other hand, it turns out to be very useful when you have to increase your programming skills. So, for beginners it could be an optimal way to become a good programmer in R. In fact, this is the way my professors taught me statistical models in R. For people with more experience, it is better to learn directly caret with all the functions that it offers. When I started learning Python, I decided instead to perform straight machine learning from the scikit-learn package.
Personally, I found the scikit-learn package better than caret, in particular it is more user friendly and the web documentation is clearer. A minor weakness of scikit-learn is that in model outputs, it doesn’t show the p-values, however, you can compensate this with the library statsmodel where you can find more useful statistical indicators. In caret, you have p-value and in general you have more statistical information in the output (eg, standard error and t- value). Another possible advantage of R is that you only need to import the caret package, whereas in Python you must import every single machine learning tool you want to use.
CONCLUSIONS
In the end, I think that both programming languages are amazing for data science. The progress of both of them in the last years was awesome and we are very lucky to have the possibility to choose our preferred software to perform our work. I recommend using both R and Python languages, in general, but surely for some operations I have a preference. As I mentioned before, I prefer R for data visualization and Python for machine learning.
So the final question is: If I already know very well a programming language to perform data science, is it beneficial to learn another one?
My answer is definitely affirmative for many reasons. First, by learning a new programming language, you will surely learn some things (eg, commands or functions) that you might not even know in the other programming language. Second, if you know two different programming languages very well you also know their strengths and weaknesses and you can decide when it is better to use one or the other. Lastly, having more knowledge and experience in your work is always a good thing, and knowing more programming languages to perform data science could be very useful to collaborate with colleagues with different backgrounds and to satisfy your work needs.
CODE TO CREATE GRAPHS
R
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(ggeasy)
movies = read.csv("movies.csv")
movies$genre_movie = NA
for (i in 1:nrow(movies)) {
if (grepl("Comedy", movies$GENRE[i])) {movies$genre_movie[i] = "Comedy"}
if (grepl("Horror", movies$GENRE[i])) {movies$genre_movie[i] = "Horror"}
}
movies$VOTES = as.numeric(gsub(",", "", movies$VOTES))
movies = na.omit(movies)
movies = movies[order(movies$RATING, decreasing = TRUE), ]
first_graph = ggplot(data = movies) +
geom_point(mapping = aes(x = RATING, y = VOTES, color = genre_movie),size = 1) +ggtitle("Total votes and Rating of movies/tv series \n divided bygenre") +
xlab("Rating") +ylab("Total votes") +
ggeasy::easy_center_title() +theme(plot.title = element_text(size = 13))second_graph = ggplot(movies[1:50, ], aes(x = RATING, y = VOTES)) +
geom_point(color = "blue", size = 0.9) +
geom_label_repel(aes(label = ifelse(VOTES>50000, as.character(MOVIES), '')),
box.padding = 0.2,
color = "red",
fill = "yellow",
point.padding = 0.2,
label.size = 0.1,
force = 50,
size = 3,
max.overlaps = 20,
segment.color = 'grey50') +
ggtitle("Movies/Tv series most rated \n between the 50 better rated") +
xlab("Rating") +ylab("Total votes")+
ggeasy::easy_center_title() +theme(plot.title = element_text(size=13))
PYTHON
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# I have read directly the final dataset managed with R
movies = pd.read_csv("movies_sort.csv")# First graph
fig, ax = plt.subplots(1, 1, figsize=(12, 7), sharey=False)
ax = sns.scatterplot(data=movies, x="RATING",y="VOTES", hue="genre_movie", s=25, ax=ax)
ax.set_title("Total votes and Rating of movies/tv series \n divided by genre", size = 20)
ax.set_xlabel("Rating", size=18)
ax.set_ylabel("Total votes", size=18)
plt.tight_layout()
plt.show()# Second graph
movies1 = movies[0:50]
plt.figure(figsize=(12, 7))
sns.scatterplot(data=movies1, x="RATING", y="VOTES")
for i in movies1[movies1.VOTES > 50000].index:plt.text(x = movies1.RATING[i] + 0.02,y = movies1.VOTES[i] + 0.05,s = movies1.MOVIES[i],fontdict = dict(color="red", size = 10),
bbox = dict(facecolor="yellow", alpha = 0.5))
plt.title("Movies/Tv series most rated \n between the 50 better rated",size=20)
plt.xlabel("Rating", size=18)
plt.ylabel("Total votes", size=18)
plt.show()