Data manipulation primitives in R and Python

时间:2023-02-22 12:23:14

Data manipulation primitives in R and Python

Both R and Python are incredibly good tools to manipulate your data and their integration is becoming increasingly important1. The latest tool for data manipulation in R is Dplyr2 whilst Python relies onPandas3.

In this blog post I'll show you the fundamental primitives to manipulate your dataframes using both libraries highlighting their major advantages and disadvantages.

Theory first

Data Frames are basically tables. Codd, E.F. in 1970 defined Relational algebra4 as the basic the theory to work on relational tables. It defines the following operations:

  • Projection (π)
  • Selection (σ)
  • Rename (ρ)
  • Set operators (union, difference, cartesian product)
  • Natural join (⋈)

SQL dialects also added the following

  • Aggregations
  • Group by operations

Why we care? People redefined these basic operations over and over in the last 40 years starting with SQL until today latest query languages. This framework will give us a general language independent perspective on the data manipulation.

Hands on

I will use the nycflights13 dataset used to introduce dplyr5 to present the functions. If you are interested you can download this entire blog post as an IPython notebook. Let's initialise our environment first:

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Load the R magic command
%load_ext rpy2.ipython
from rpy2.robjects import pandas2ri
 
# numpy available as np
# pyplot available as ply
%pylab
 
from pandas import *
 
Using matplotlib backend: MacOSX
Populating the interactive namespace from numpy and matplotlib
 
%%R -o flights
#sink(type="output")
sink("/dev/null")
library("dplyr");
library(nycflights13);
 
flights = pandas2ri.ri2py(flights)
 

Data summary

In both libraries it is possible to quickly print a quick summary of your dataframe. Pandas has an object oriented approach and you can invokehead()tail() and describe() directly on your dataframe object. R has a procedural approach and its functions take a dataframe as the first input.

Python R
df.head() head(df)
df.tail() tail(df)
df.describe() summary(df)
 
1
2
3
4
5
6
7
8
9
10
# Python
 
flights.head();
flights.tail();
flights.describe();
 
%R head(flights);
%R tail(flights);
%R summary(flights);
 

Selection

In pandas in order to select multiple rows you need to use the []operator with the element-wise operators like & and |. If you don't use the element-wise operators you will get the following error: ValueError: The truth value of a Series is ambiguous. Another solution is to install the numexpr6 package and use the query() function.

dplyr instead provides the filter() function. Combined with the pipe operator %>% the code is incredibly readable in my opinion. Notice how repeating the dataframe variable in the boolean expression is not needed in this case.

Python R
df[bool expr with element-wise operators] df %>% filter(bool expr)
df.query('bool expr')  
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Python
 
# Simple filtering
flights[flights.month <= 3];
 
# Filtering with element-wise operators
flights[(flights.month >= 3) & (flights.month <= 6)];
 
# with numexpr
flights.query('month >= 3 & month <= 6');
 
 
%R flights %>% filter(month >= 3 & month <= 6);
 

Projection

You can use the projection operation to extract one (or more) columns from a dataframe. In Python you pass an array with the columns you are interested in to the DataFrame object. In dplyr the projection function is called select, inspired by SQL.

Python R
df[['col_1', 'col_2']] df %>% select(col_1, col_2)
 
1
2
3
4
5
6
# Python
flights[['year', 'month']];
 
 
%R flights %>% select(month, year);
 

Rename

The rename operation is used to simply rename a column of your dataframe keeping the content intact. In pandas you use the rename7function and you provide a dictionary. In R you use a comma separated list of assignments.

Python R
df.rename(columns={'col_name': 'col_new_name'}) df %>% rename(col_new_name = col_name)
 
1
2
3
4
5
6
# Python
flights.rename(columns={'month': 'TheMonth'});
 
 
%R flights %>% rename(TheMonth = month);
 

Union

The relational algebra uses set union, set difference, and Cartesian product from set theory, with the extra constraint that tables must be "compatible", i.e. they must have the same columns.

You can use the union in Pandas using the concat()8 operation. You need to take some extra care for indexes though and for that I'll forward you to the docs9.

In R you rely on the bind_rows10 operator.

Python R
concat([df_1, df_2]); rbind_list(df_1, df_2)
 
1
2
3
4
5
6
7
8
9
10
11
12
# Python
 
df_1 = flights.query('dep_time == 518');
df_2 = flights.query('dep_time == 517');
concat([df_1, df_2]);
 
%%R
sink("/dev/null")
df_1 = filter(flights, dep_time == 517);
df_2 = filter(flights, dep_time == 518);
bind_rows(df_1, df_2);
 

Difference

To the best of my knowledge there is no set difference operator in Python. In order to achieve the result we must rely on the select operator.

In dplyr there is a native operator setdiff that does exactly what we expect.

Python R
set_a[~set_a.column.isin(set_b.column)] set_a %>% setdiff(set_b)
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Python
 
set_a = flights.query('dep_time == 517 | dep_time == 518');
set_b = flights.query('dep_time == 518 | dep_time == 519');
 
selection = ~set_a.dep_time.isin(set_b.dep_time);
set_a[selection];
 
%%R
#sink(type="output")
#sink("/dev/null")
set_a = filter(flights, dep_time == 517 | dep_time == 518);
set_b = filter(flights, dep_time == 518 | dep_time == 519);
set_a %>% setdiff(set_b)
 

Cartesian product

You can use the cartesian product to combine two tables with a disjoint set of columns. In practice this is not a very common operation and as a result both libraries lack a pure cartesian product (in favour of the way more common join operator).

A simple trick to overcome this limitation is to create a temporary column first, perform the join and finally remove the temporary column. This can be done both in Python and R using the merge() andfull_join methods.

Python R
merge(...) with tmp column full_join(...) with tmp column
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Python
 
df_1 = DataFrame({
        'name': ['Jack', 'Mario', 'Luigi']
    });
df_2 = DataFrame({
        'surname': ['Rossi', 'Verdi', 'Reacher']
    });
 
df_1['tmp'] = np.nan;
df_2['tmp'] = np.nan;
 
merge(df_1, df_2, on='tmp').drop('tmp', axis=1);
 
%%R
#sink(type="output")
sink("/dev/null")
df_1 = data.frame(
    name=c('Jack', 'Mario', 'Luigi'),
    stringsAsFactors=FALSE)
 
df_2 = data.frame(
    surname=c('Rossi', 'Verdi', 'Reacher'),
    stringsAsFactors=FALSE)
 
df_1$tmp = NA
df_2$tmp = NA
 
full_join(df_1, df_2, by="tmp") %>% select(-tmp)
 

Natural Join

If you are used to SQL you are definitely aware of what a join operation is. Given two dataframe R and S the result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names.

In Python you can rely on the very powerful merge11 command.
In R Dplyr you can use the full_join12 command.

Python R
merge(..., on="key", how="outer") full_join(..., by="key")
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Python
df_1 = DataFrame({
        'name': ['Jack', 'Mario', 'Luigi'],
        'department_id' : [30, 31, 31]
    });
df_2 = DataFrame({
        'department_name': ['Sales', 'Product', 'Finance'],
        'department_id' : [30, 31, 32]
    });
 
merge(df_1, df_2, on="department_id", how="outer");
 
%%R
#sink(type="output")
sink("/dev/null")
 
df_1 = data.frame(
    name=c('Jack', 'Mario', 'Luigi'),
    department_id=c(30, 31, 31),
    stringsAsFactors=FALSE)
 
df_2 = data.frame(
    department_name=c('Sales', 'Product', 'Finance'),
    department_id=c(30, 31, 32),
    stringsAsFactors=FALSE)
 
full_join(df_1, df_2, by="department_id")
 

Aggregations

An aggregate function is a function that takes the values of a certain column to form a single value of more significant meaning. Typical aggregate functions available in the most common SQL dialects include Average(), Count(), Maximum(), Median(), Minimum(), Mode(), Sum().

Both R and Python dataframes provides methods to extract this information. In this case I would say that Python handle missing values default better, whilst on R we have to provide na.rm = TRUE.

Python R
df.<column>.mean() summarise(df, test=mean(<column>, na.rm = TRUE))
df.<column>.median() summarise(df, test=median(<column>, na.rm = TRUE))
df.<column>.std() summarise(df, test=sd(<column>, na.rm = TRUE))
df.<column>.var() summarise(df, test=var(<column>, na.rm = TRUE))
df.<column>.min() summarise(df, test=min(<column>, na.rm = TRUE))
df.<column>.max() summarise(df, test=max(<column>, na.rm = TRUE))
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Python
 
flights.dep_time.mean();
flights.dep_time.median();
flights.dep_time.std();
flights.dep_time.var();
flights.dep_time.min();
flights.dep_time.max();
flights.dep_time.mean();
 
%%R
#sink(type="output")
sink("/dev/null")
 
summarise(flights, test=mean(dep_time, na.rm = TRUE))
summarise(flights, test=median(dep_time, na.rm = TRUE))
summarise(flights, test=min(dep_time, na.rm = TRUE))
summarise(flights, test=max(dep_time, na.rm = TRUE))
summarise(flights, test=sd(dep_time, na.rm = TRUE))
summarise(flights, test=var(dep_time, na.rm = TRUE))
 

Group by

Aggregations become useful especially when used in conjunction with the group by operator. This way we are able to compute statistics for a number of group subsets with just one command.

Both Python and R provides the function to run a group by.

Python R
df.groupby('<column>') df %>% group_by(<column>)
 
1
2
3
4
5
6
7
8
9
10
11
12
13
# Python
 
# For any given day compute the mean of the flights departure time
flights.groupby('day').dep_time.mean();
 
%%R
#sink(type="output")
sink("/dev/null")
 
flights %>%
    group_by(day) %>%
    summarise(dep_time_mean=mean(dep_time, na.rm = TRUE))
 

Conclusion

I think the Pandas and Dplyr paradigms are very different between each other.

Pandas is more focused on object orientation and good defaults. Indexes are a first class entity and as a result some operations that you expect to be simple are instead quite difficult to grasp.

Conversely Dplyr is procedural. I love the pipe operator and manipulating my data feels incredibly smooth. The only sad note is that sometimes functions defaults are not that great. I haven't tested the speed in this blog post but I assume that since indexes are hidden in Dplyr the speed is probably much lower in general.

In conclusion I feel like Dplyr and R are the perfect tool for some early exploration of the data. But if you are serious about the code you are producing you should probably switch to Python to productionise your data analysis.

Let me know your approach in the comments!

References

    1. IEEE 2015 top programming languages
    2. Dplyr homepage
    3. Pandas homepage
    4. Relational Algebra
    5. NYC Flights 2013 dataset
    6. Numexpr python package
    7. pandas.DataFrame.rename
    8. Merging data frames in Panda
    9. Concatenating objects in Pandas
    10. Dplyr bind function
    11. Pandas merge function
    12. Dplyr documentation

Data manipulation primitives in R and Python的更多相关文章

  1. Best packages for data manipulation in R

    dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...

  2. Data Manipulation with dplyr in R

    目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...

  3. The dplyr package has been updated with new data manipulation commands for filters&comma; joins and set operations&period;(转)

    dplyr 0.4.0 January 9, 2015 in Uncategorized I’m very pleased to announce that dplyr 0.4.0 is now av ...

  4. R 调用 python

    上一篇说了python使用 rpy2 调用 R,这里介绍R如何调用python.R的强项在于统计方面,尤其是专业的统计分析,统计检验以及作图功能十分强大,但是在通用性方面,就远不如Python了,比如 ...

  5. R vs Python,数据分析中谁与争锋?

    R和Python两者谁更适合数据分析领域?在某些特定情况下谁会更有优势?还是一个天生在各方面都比另一个更好? 当我们想要选择一种编程语言进行数据分析时,相信大多数人都会想到R和Python——但是从这 ...

  6. 随机森林入门攻略(内含R、Python代码)

    随机森林入门攻略(内含R.Python代码) 简介 近年来,随机森林模型在界内的关注度与受欢迎程度有着显著的提升,这多半归功于它可以快速地被应用到几乎任何的数据科学问题中去,从而使人们能够高效快捷地获 ...

  7. 让R与Python共舞

    转载:http://ices01.sinaapp.com/?p=129      R(又称R语言)是一款开源的跨平台的数值统计和数值图形化展现 工具.通俗点说,R是用来做统计和画图的.R拥有自己的脚本 ...

  8. 决策树ID3原理及R语言python代码实现(西瓜书)

    决策树ID3原理及R语言python代码实现(西瓜书) 摘要: 决策树是机器学习中一种非常常见的分类与回归方法,可以认为是if-else结构的规则.分类决策树是由节点和有向边组成的树形结构,节点表示特 ...

  9. 做量化模型Matlab、R、Python、F&num;和C&plus;&plus;到底选择哪一个?

    MATLAB是matrix&laboratory两个词的组合,意为矩阵工厂(矩阵实验室).是由美国mathworks公司发布的主要面对科学计算.可视化以及交互式程序设计的高科技计算环境.它将数 ...

随机推荐

  1. 设计模式之单例模式Singleton(三创建型)

    1.什么事单例模式? 单例模式确保某个类只有一个实例,而且自行实例化并向整个系统提供这个实例. 单例模式有以下特点: 1.单例类只能有一个实例. 2.单例类必须自己创建自己的唯一实例. 3.单例类必须 ...

  2. AWT编程学习01&lpar;未完成&rpar;

    本文资料来源:<java疯狂讲义> 作者:李刚 终于要学习这一部分了~~虽然很多人(明明是绝大多数的人)说学这个没有用...而且有点过时了...但我觉得很有意思啊...感兴趣就学~~反正多 ...

  3. &OpenCurlyDoubleQuote;Assign Random Colors” is not working in 3ds Max 2015

    Go to Customize -> Preferences…-> General (tab) Uncheck “Default to By Layer for New Nodes”

  4. uvalive 7331 Hovering Hornet 半平面交&plus;概率期望

    题意:一个骰子在一个人正方形内,蜜蜂在任意一个位置可以出现,问看到点数的期望. 思路:半平面交+概率期望 #include<cstdio> #include<cstring> ...

  5. Asp&period;Net mvc筛选器中返回信息中断操作

    在mvc中,使用response.end()或Response.Redirect("url"); 是无法阻止请求继续往下执行的.如果在action中,可以我们可以使用return ...

  6. BZOJ 2435 NOI2011 道路建设 BFS&sol;DFS

    标题效果:给定一个树(直接将树.不要贪图生成树图!).寻找每条边权值*分差的两侧之间 BFS水必须是能 竟DFS能够住...系统堆栈可能有些不够,我们可以使用内联汇编手册中大型系统堆栈 详见代码 这个 ...

  7. 201521123023《Java程序设计》第10周学习总结

    1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结异常与多线程相关内容. 2. 书面作业 本次PTA作业题集异常.多线程 1.finally 题目4-2 1.1 截图你的提交结果(出 ...

  8. 创建并在项目中调用SQLSERVER存储过程的简单示例

    使用SQLSERVER存储过程可以很大的提高程序运行速度,简化编程维护难度,现已得到广泛应用.创建存储过程 和数据表一样,在使用之前需要创建存储过程,它的简明语法是: 引用: Create PROC ...

  9. &lbrack;node&period;js&rsqb; fs&period;renameSync&lpar;&rpar;报错

    初学node.js,跟着node入门,操作了一遍.在最后一步,上传图片并显示时遇到报错 fs.js: throw err; ^ Error: ENOENT: no such file or direc ...

  10. CentOS 6&period;x 如何升级 glibc 2&period;17

    CentOS 6.x 如何升级 glibc 2.17 ldd --version rpm -qa | grep glibc #查看glibc的版本 strings /lib64/libc.so.6 | ...