如何将数据从长格式转换为宽格式?

时间:2022-03-19 22:53:10

I'm having trouble rearranging the following data frame:

我无法重新排列以下数据框:

set.seed(45)dat1 <- data.frame(    name = rep(c("firstName", "secondName"), each=4),    numbers = rep(1:4, 2),    value = rnorm(8)    )dat1       name  numbers      value1  firstName       1  0.34079972  firstName       2 -0.70334033  firstName       3 -0.37953774  firstName       4 -0.74604745 secondName       1 -0.89810736 secondName       2 -0.33479417 secondName       3 -0.50137828 secondName       4 -0.1745357

I want to reshape it so that each unique "name" variable is a rowname, with the "values" as observations along that row and the "numbers" as colnames. Sort of like this:

我想重塑它,以便每个唯一的“名称”变量是一个rowname,其中“值”作为沿该行的观察值,“数字”作为同名。有点像:

     name          1          2          3         41  firstName  0.3407997 -0.7033403 -0.3795377 -0.74604745 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

I've looked at melt and cast and a few other things, but none seem to do the job.

我看过融化和演员以及其他一些事情,但似乎都没有做到这一点。

9 个解决方案

#1


180  

Using reshape function:

使用重塑功能:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

#2


95  

The new (in 2014) tidyr package also does this simply, with gather()/spread() being the terms for melt/cast.

新的(2014年)tidyr软件包也可以简单地完成这项工作,其中gather()/ spread()是熔化/铸造的术语。

library(tidyr)spread(dat1, key = numbers, value = value)

From github,

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

tidyr是reshape2的重新设计,旨在配合整洁的数据框架,并与magrittr和dplyr携手合作,为数据分析打造坚实的管道。

Just as reshape2 did less than reshape, tidyr does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

就像reshape2重塑不到重塑一样,tidyr不会重塑2。它专门用于整理数据,而不是reshape2执行的一般重塑,或者重塑数据的一般重组。特别是,内置方法仅适用于数据帧,而tidyr不提供边距或聚合。

#3


62  

You can do this with the reshape() function, or with the melt() / cast() functions in the reshape package. For the second option, example code is

您可以使用reshape()函数或reshape包中的melt()/ cast()函数执行此操作。对于第二个选项,示例代码是

library(reshape)cast(dat1, name ~ numbers)

Or using reshape2

或者使用reshape2

library(reshape2)dcast(dat1, name ~ numbers)

#4


26  

Another option if performance is a concern is to use data.table's extension of reshape2's melt & dcast functions

如果性能受到关注,另一个选择是使用data.table扩展reshape2的融合和dcast功能

(Reference: Efficient reshaping using data.tables)

(参考:使用data.tables进行高效重塑)

library(data.table)setDT(dat1)dcast(dat1, name ~ numbers, value.var = "value")#          name          1          2         3         4# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814

And, as of data.table v1.9.6 we can cast on multiple columns

而且,从data.table v1.9.6开始,我们可以在多列上进行转换

## add an extra columndat1[, value2 := value * 2]## cast multiple value columnsdcast(dat1, name ~ numbers, value.var = c("value", "value2"))#          name    value_1    value_2   value_3   value_4   value2_1   value2_2 value2_3  value2_4# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078  0.3672866 -1.6712572 3.190562 0.6590155# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814 -1.6409368  0.9748581 1.476649 1.1515627

#5


22  

Using your example dataframe, we could:

使用您的示例数据框,我们可以:

xtabs(value ~ name + numbers, data = dat1)

#6


14  

Other two options:

其他两个选择:

Base package:

df <- unstack(dat1, form = value ~ numbers)rownames(df) <- unique(dat1$name)df

sqldf package:

library(sqldf)sqldf('SELECT name,      MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,       MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,      MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,      MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4      FROM dat1      GROUP BY name')

#7


7  

Using base R aggregate function:

使用基R聚合函数:

aggregate(value ~ name, dat1, I)# name           value.1  value.2  value.3  value.4#1 firstName      0.4145  -0.4747   0.0659   -0.5024#2 secondName    -0.8259   0.1669  -0.8962    0.1681

#8


4  

There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat, seplyr and replyr) called cdata. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:

来自Win-Vector(创造了vtreat,seplyr和replyr的人)的天才数据科学家提供了非常强大的新软件包,名为cdata。它实现了本文档和本博文中描述的“协调数据”原则。我们的想法是,无论您如何组织数据,都应该可以使用“数据坐标”系统识别各个数据点。以下摘录自John Mount最近的博客文章:

The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.

整个系统基于两个原语或运算符cdata :: moveValuesToRowsD()和cdata :: moveValuesToColumnsD()。这些运算符具有pivot,un-pivot,one-hot编码,转置,移动多行和多列以及许多其他转换作为简单的特殊情况。

It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.

根据cdata原语很容易编写许多不同的操作。这些运算符可以在内存或大数据规模下工作(使用数据库和Apache Spark;对于大数据,使用cdata :: moveValuesToRowsN()和cdata :: moveValuesToColumnsN()变体)。变换由控制表控制,控制表本身是变换的图(或图片)。

We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.

我们将首先构建控制表(有关详细信息,请参阅博客文章),然后执行从行到列的数据移动。

library(cdata)# first build the control tablepivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset                        columnToTakeKeysFrom = 'numbers', # this will become column headers                        columnToTakeValuesFrom = 'value', # this contains data                        sep="_")                          # optional for making column names# perform the move of data to columnsdat_wide <- moveValuesToColumnsD(tallTable =  dat1, # reference to dataset                    keyColumns = c('name'),         # this(these) column(s) should stay untouched                     controlTable = pivotControlTable# control table above                    ) dat_wide#>         name  numbers_1  numbers_2  numbers_3  numbers_4#> 1  firstName  0.3407997 -0.7033403 -0.3795377 -0.7460474#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

#9


2  

The base reshape function works perfectly fine:

基本重塑功能完美无缺:

df <- data.frame(  year   = c(rep(2000, 12), rep(2001, 12)),  month  = rep(1:12, 2),  values = rnorm(24))df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")df_wide

Where

  • idvar is the column of classes that separates rows
  • idvar是分隔行的类列

  • timevar is the column of classes to cast wide
  • timevar是要广泛投射的类的列

  • v.names is the column containing numeric values
  • v.names是包含数值的列

  • direction specifies wide or long format
  • direction指定宽或长格式

  • the optional sep argument is the separator used in between timevar class names and v.names in the output data.frame.
  • 可选的sep参数是timevar类名和输出data.frame中的v.names之间使用的分隔符。

If no idvar exists, create one before using the reshape() function:

如果不存在idvar,请在使用reshape()函数之前创建一个:

df$id   <- c(rep("year1", 12), rep("year2", 12))df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")df_wide

Just remember that idvar is required! The timevar and v.names part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.

请记住,idvar是必需的! timevar和v.names部分很简单。这个函数的输出比其他一些函数更容易预测,因为所有内容都是明确定义的。

#1


180  

Using reshape function:

使用重塑功能:

reshape(dat1, idvar = "name", timevar = "numbers", direction = "wide")

#2


95  

The new (in 2014) tidyr package also does this simply, with gather()/spread() being the terms for melt/cast.

新的(2014年)tidyr软件包也可以简单地完成这项工作,其中gather()/ spread()是熔化/铸造的术语。

library(tidyr)spread(dat1, key = numbers, value = value)

From github,

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

tidyr是reshape2的重新设计,旨在配合整洁的数据框架,并与magrittr和dplyr携手合作,为数据分析打造坚实的管道。

Just as reshape2 did less than reshape, tidyr does less than reshape2. It's designed specifically for tidying data, not the general reshaping that reshape2 does, or the general aggregation that reshape did. In particular, built-in methods only work for data frames, and tidyr provides no margins or aggregation.

就像reshape2重塑不到重塑一样,tidyr不会重塑2。它专门用于整理数据,而不是reshape2执行的一般重塑,或者重塑数据的一般重组。特别是,内置方法仅适用于数据帧,而tidyr不提供边距或聚合。

#3


62  

You can do this with the reshape() function, or with the melt() / cast() functions in the reshape package. For the second option, example code is

您可以使用reshape()函数或reshape包中的melt()/ cast()函数执行此操作。对于第二个选项,示例代码是

library(reshape)cast(dat1, name ~ numbers)

Or using reshape2

或者使用reshape2

library(reshape2)dcast(dat1, name ~ numbers)

#4


26  

Another option if performance is a concern is to use data.table's extension of reshape2's melt & dcast functions

如果性能受到关注,另一个选择是使用data.table扩展reshape2的融合和dcast功能

(Reference: Efficient reshaping using data.tables)

(参考:使用data.tables进行高效重塑)

library(data.table)setDT(dat1)dcast(dat1, name ~ numbers, value.var = "value")#          name          1          2         3         4# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814

And, as of data.table v1.9.6 we can cast on multiple columns

而且,从data.table v1.9.6开始,我们可以在多列上进行转换

## add an extra columndat1[, value2 := value * 2]## cast multiple value columnsdcast(dat1, name ~ numbers, value.var = c("value", "value2"))#          name    value_1    value_2   value_3   value_4   value2_1   value2_2 value2_3  value2_4# 1:  firstName  0.1836433 -0.8356286 1.5952808 0.3295078  0.3672866 -1.6712572 3.190562 0.6590155# 2: secondName -0.8204684  0.4874291 0.7383247 0.5757814 -1.6409368  0.9748581 1.476649 1.1515627

#5


22  

Using your example dataframe, we could:

使用您的示例数据框,我们可以:

xtabs(value ~ name + numbers, data = dat1)

#6


14  

Other two options:

其他两个选择:

Base package:

df <- unstack(dat1, form = value ~ numbers)rownames(df) <- unique(dat1$name)df

sqldf package:

library(sqldf)sqldf('SELECT name,      MAX(CASE WHEN numbers = 1 THEN value ELSE NULL END) x1,       MAX(CASE WHEN numbers = 2 THEN value ELSE NULL END) x2,      MAX(CASE WHEN numbers = 3 THEN value ELSE NULL END) x3,      MAX(CASE WHEN numbers = 4 THEN value ELSE NULL END) x4      FROM dat1      GROUP BY name')

#7


7  

Using base R aggregate function:

使用基R聚合函数:

aggregate(value ~ name, dat1, I)# name           value.1  value.2  value.3  value.4#1 firstName      0.4145  -0.4747   0.0659   -0.5024#2 secondName    -0.8259   0.1669  -0.8962    0.1681

#8


4  

There's very powerful new package from genius data scientists at Win-Vector (folks that made vtreat, seplyr and replyr) called cdata. It implements "coordinated data" principles described in this document and also in this blog post. The idea is that regardless how you organize your data, it should be possible to identify individual data points using a system of "data coordinates". Here's a excerpt from the recent blog post by John Mount:

来自Win-Vector(创造了vtreat,seplyr和replyr的人)的天才数据科学家提供了非常强大的新软件包,名为cdata。它实现了本文档和本博文中描述的“协调数据”原则。我们的想法是,无论您如何组织数据,都应该可以使用“数据坐标”系统识别各个数据点。以下摘录自John Mount最近的博客文章:

The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD(). These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.

整个系统基于两个原语或运算符cdata :: moveValuesToRowsD()和cdata :: moveValuesToColumnsD()。这些运算符具有pivot,un-pivot,one-hot编码,转置,移动多行和多列以及许多其他转换作为简单的特殊情况。

It is easy to write many different operations in terms of the cdata primitives. These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants). The transforms are controlled by a control table that itself is a diagram of (or picture of) the transform.

根据cdata原语很容易编写许多不同的操作。这些运算符可以在内存或大数据规模下工作(使用数据库和Apache Spark;对于大数据,使用cdata :: moveValuesToRowsN()和cdata :: moveValuesToColumnsN()变体)。变换由控制表控制,控制表本身是变换的图(或图片)。

We will first build the control table (see blog post for details) and then perform the move of data from rows to columns.

我们将首先构建控制表(有关详细信息,请参阅博客文章),然后执行从行到列的数据移动。

library(cdata)# first build the control tablepivotControlTable <- buildPivotControlTableD(table = dat1, # reference to dataset                        columnToTakeKeysFrom = 'numbers', # this will become column headers                        columnToTakeValuesFrom = 'value', # this contains data                        sep="_")                          # optional for making column names# perform the move of data to columnsdat_wide <- moveValuesToColumnsD(tallTable =  dat1, # reference to dataset                    keyColumns = c('name'),         # this(these) column(s) should stay untouched                     controlTable = pivotControlTable# control table above                    ) dat_wide#>         name  numbers_1  numbers_2  numbers_3  numbers_4#> 1  firstName  0.3407997 -0.7033403 -0.3795377 -0.7460474#> 2 secondName -0.8981073 -0.3347941 -0.5013782 -0.1745357

#9


2  

The base reshape function works perfectly fine:

基本重塑功能完美无缺:

df <- data.frame(  year   = c(rep(2000, 12), rep(2001, 12)),  month  = rep(1:12, 2),  values = rnorm(24))df_wide <- reshape(df, idvar="year", timevar="month", v.names="values", direction="wide", sep="_")df_wide

Where

  • idvar is the column of classes that separates rows
  • idvar是分隔行的类列

  • timevar is the column of classes to cast wide
  • timevar是要广泛投射的类的列

  • v.names is the column containing numeric values
  • v.names是包含数值的列

  • direction specifies wide or long format
  • direction指定宽或长格式

  • the optional sep argument is the separator used in between timevar class names and v.names in the output data.frame.
  • 可选的sep参数是timevar类名和输出data.frame中的v.names之间使用的分隔符。

If no idvar exists, create one before using the reshape() function:

如果不存在idvar,请在使用reshape()函数之前创建一个:

df$id   <- c(rep("year1", 12), rep("year2", 12))df_wide <- reshape(df, idvar="id", timevar="month", v.names="values", direction="wide", sep="_")df_wide

Just remember that idvar is required! The timevar and v.names part is easy. The output of this function is more predictable than some of the others, as everything is explicitly defined.

请记住,idvar是必需的! timevar和v.names部分很简单。这个函数的输出比其他一些函数更容易预测,因为所有内容都是明确定义的。