I am trying to replicate a table often used in official statistics but no success so far. Given a dataframe like this one:
我试图复制官方统计中经常使用的表,但到目前为止没有成功。给定像这样的数据帧:
d1 <- data.frame( StudentID = c("x1", "x10", "x2",
"x3", "x4", "x5", "x6", "x7", "x8", "x9"),
StudentGender = c('F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'),
ExamenYear = c('2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'),
Exam = c('algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'),
participated = c('no','yes','yes','yes','no','yes','yes','yes','yes','yes'),
passed = c('no','yes','yes','yes','no','yes','yes','yes','no','yes'),
stringsAsFactors = FALSE)
I would like to create a table showing PER YEAR , the number of all students (all) and those who are female, those who participated and those who passed. Please note "ofwhich" below refers to all students.
我想创建一个表格,显示每年,所有学生(所有)和女性,参与者和通过的人数。请注意下面的“ofwhich”指的是所有学生。
A table I have in mind would look like that:
我想到的一张桌子看起来像这样:
cbind(All = table(d1$ExamenYear),
participated = table(d1$ExamenYear, d1$participated)[,2],
ofwhichFemale = table(d1$ExamenYear, d1$StudentGender)[,1],
ofwhichpassed = table(d1$ExamenYear, d1$passed)[,2])
I am sure there is a better way to this kind of thing in R.
我相信在R.这种事情有更好的方法。
Note: I have seen LaTex solutions, but I am not use this will work for me as I need to export the table in Excel .
注意:我已经看过LaTex解决方案,但我没有使用这对我有用,因为我需要在Excel中导出表。
Thanks in advance
提前致谢
4 个解决方案
#1
8
Using plyr
:
require(plyr)
ddply(d1, .(ExamenYear), summarize,
All=length(ExamenYear),
participated=sum(participated=="yes"),
ofwhichFemale=sum(StudentGender=="F"),
ofWhichPassed=sum(passed=="yes"))
Which gives:
ExamenYear All participated ofwhichFemale ofWhichPassed
1 2007 3 2 2 2
2 2008 4 3 2 3
3 2009 3 3 0 2
#2
4
The plyr
package is great for this sort of thing. First load the package
plyr包非常适合这类事情。首先加载包
library(plyr)
Then we use the ddply
function:
然后我们使用ddply函数:
ddply(d1, "ExamenYear", summarise,
All = length(passed),##We can use any column for this statistics
participated = sum(participated=="yes"),
ofwhichFemale = sum(StudentGender=="F"),
ofwhichpassed = sum(passed=="yes"))
Basically, ddply expects a dataframe as input and returns a data frame. We then split up the input data frame by ExamenYear
. On each sub table we calculate a few summary statistics. Notice that in ddply, we don't have to use the $
notation when referring to columns.
基本上,ddply期望数据帧作为输入并返回数据帧。然后我们通过ExamenYear拆分输入数据框。在每个子表上,我们计算一些汇总统计信息。请注意,在ddply中,我们在引用列时不必使用$表示法。
#3
4
There could have been a couple of modifications (use with
to reduce the number of df$
calls and use character indices to improve self-documentation) to your code that would have made it easier to read and a worthy competitor to the ddply
solutions:
可能会对您的代码进行一些修改(用于减少df $调用的数量并使用字符索引来改进自我文档),这将使其更容易阅读并成为ddply解决方案的有价值的竞争对手:
with( d1, cbind(All = table(ExamenYear),
participated = table(ExamenYear, participated)[,"yes"],
ofwhichFemale = table(ExamenYear, StudentGender)[,"F"],
ofwhichpassed = table(ExamenYear, passed)[,"yes"])
)
All participated ofwhichFemale ofwhichpassed
2007 3 2 2 2
2008 4 3 2 3
2009 3 3 0 2
I would expect this to be much faster than the ddply solution, although that will only be apparent if you are working on larger datasets.
我希望这比ddply解决方案快得多,尽管只有在处理更大的数据集时才会显而易见。
#4
1
You may also want to take a look of the plyr's next iterator: dplyr
您可能还想了解一下plyr的下一个迭代器:dplyr
It uses a ggplot-like syntax and provide fast performance by writing key pieces in C++.
它使用类似ggplot的语法,并通过在C ++中编写关键部分来提供快速性能。
d1 %.%
group_by(ExamenYear) %.%
summarise(ALL=length(ExamenYear),
participated=sum(participated=="yes"),
ofwhichFemale=sum(StudentGender=="F"),
ofWhichPassed=sum(passed=="yes"))
#1
8
Using plyr
:
require(plyr)
ddply(d1, .(ExamenYear), summarize,
All=length(ExamenYear),
participated=sum(participated=="yes"),
ofwhichFemale=sum(StudentGender=="F"),
ofWhichPassed=sum(passed=="yes"))
Which gives:
ExamenYear All participated ofwhichFemale ofWhichPassed
1 2007 3 2 2 2
2 2008 4 3 2 3
3 2009 3 3 0 2
#2
4
The plyr
package is great for this sort of thing. First load the package
plyr包非常适合这类事情。首先加载包
library(plyr)
Then we use the ddply
function:
然后我们使用ddply函数:
ddply(d1, "ExamenYear", summarise,
All = length(passed),##We can use any column for this statistics
participated = sum(participated=="yes"),
ofwhichFemale = sum(StudentGender=="F"),
ofwhichpassed = sum(passed=="yes"))
Basically, ddply expects a dataframe as input and returns a data frame. We then split up the input data frame by ExamenYear
. On each sub table we calculate a few summary statistics. Notice that in ddply, we don't have to use the $
notation when referring to columns.
基本上,ddply期望数据帧作为输入并返回数据帧。然后我们通过ExamenYear拆分输入数据框。在每个子表上,我们计算一些汇总统计信息。请注意,在ddply中,我们在引用列时不必使用$表示法。
#3
4
There could have been a couple of modifications (use with
to reduce the number of df$
calls and use character indices to improve self-documentation) to your code that would have made it easier to read and a worthy competitor to the ddply
solutions:
可能会对您的代码进行一些修改(用于减少df $调用的数量并使用字符索引来改进自我文档),这将使其更容易阅读并成为ddply解决方案的有价值的竞争对手:
with( d1, cbind(All = table(ExamenYear),
participated = table(ExamenYear, participated)[,"yes"],
ofwhichFemale = table(ExamenYear, StudentGender)[,"F"],
ofwhichpassed = table(ExamenYear, passed)[,"yes"])
)
All participated ofwhichFemale ofwhichpassed
2007 3 2 2 2
2008 4 3 2 3
2009 3 3 0 2
I would expect this to be much faster than the ddply solution, although that will only be apparent if you are working on larger datasets.
我希望这比ddply解决方案快得多,尽管只有在处理更大的数据集时才会显而易见。
#4
1
You may also want to take a look of the plyr's next iterator: dplyr
您可能还想了解一下plyr的下一个迭代器:dplyr
It uses a ggplot-like syntax and provide fast performance by writing key pieces in C++.
它使用类似ggplot的语法,并通过在C ++中编写关键部分来提供快速性能。
d1 %.%
group_by(ExamenYear) %.%
summarise(ALL=length(ExamenYear),
participated=sum(participated=="yes"),
ofwhichFemale=sum(StudentGender=="F"),
ofWhichPassed=sum(passed=="yes"))