rdcast错误/在dataframe中发现不规则的id

My dataframe looks like this:

我的dataframe是这样的:

ID | value A | value B
1  |   A1    |   F
1  |   A2    |   N
1  |   A3    |   B
1  |   A4    |   S
2  |   A1    |   B
2  |   A2    |   G
2  |   A3    |   N
3  |   A1    |   F
3  |   A2    |   H
3  |   A3    |   J
3  |   A4    |   N

So I have 4 rows for one ID each. I am trying to use the dcast() function, but it only works if all IDs have the same number of rows. ID No. 2 would be an error case in this example. Is there any easy way to find all IDs that have more or less than 4 rows? Or may be is there any way to make the dcast function ignore the error cases?

每个ID有4行。我正在尝试使用dcast()函数，但它只在所有id具有相同的行数时有效。ID 2在本例中是一个错误的情况。有什么简单的方法可以找到所有大于或小于4行的id吗?或者有什么方法可以让dcast函数忽略错误情况?

Originally I am trying to reshape the dataframe to get something like this:

最初，我试图重塑dataframe，得到如下内容:

ID | A1 | A2 | A3 | A4
 1 | F  | N  | B  | S 
 2 | B  | G  | N  | NA
 3 | F  | H  | J  | N

Apparently the dcast() function from the reshape2 package doesn´t work with irregular IDs. It gives me the following erros message: 'Aggregation function missing: defaulting to length' But with a smaller part of my dataset - which doesn´t have those irregular iDs - it works. Any ideas? Or may be an idea how to reshape my dataframe without using dcast? Thanks!

显然,dcast()函数从reshape2包并´t与不规则的id。它给了我以下论述信息:聚合函数失踪:违约长度,但与一个较小的数据集的一部分——´t那些不规则的id,它的工作原理。什么好主意吗?或者是如何在不使用dcast的情况下重塑我的dataframe ?谢谢!

I am working on a mac with the following (package-) versions:

我正在开发一个mac电脑，有以下(包-)版本:

sessionInfo() 
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.2.1 plyr_1.7.1    

loaded via a namespace (and not attached):
[1] stringr_0.6

The first column values are all integer, the others character values.

第一列的值都是整数，其他的都是字符值。

sapply(x, class)
         ID      fach01      f01_lp 
  "integer" "character" "character"

As for the reproducible example: I hope this helps (I used my original dataframe), however if I only use the first 500 rows of the dataframe dcast() works perfectly fine, the problem occurs when I try to use the whole dataframe of about 140000 rows.

至于可重复的示例:我希望这能有所帮助(我使用了原始的dataframe)，但是如果我只使用dataframe dcast()的前500行，那么这个问题就会在我尝试使用140000行的整个dataframe时出现。

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L),  A = c("2.LF", 
"1.LF", "3.PF", "4.PF", "3.PF", "1.LF", "2.LF", "3.PF", 
"4.PF", "1.LF", "2.LF", "3.PF", "1.LF", "4.PF", "2.LF", "1.LF", 
"2.LF", "4.PF", "3.PF", "1.LF", "3.PF", "2.LF", "4.PF", "3.PF", 
"4.PF", "1.LF", "2.LF", "4.PF", "2.LF", "3.PF", "1.LF", "1.LF", 
"2.LF", "3.PF", "4.PF"), B = c("Mu/Ku", 
"Fs", "2.AF", "NW", "DE", "2.AF", "MA", "Fs", "2.AF", "NW", 
"NW", "Fs", "2.AF", "bel", "NW", "Fs", "bel", "bel", "NW", "DE", 
"2.AF", "2.AF", "MA", "Fs", "2.AF", "MA", "NW", "DE", "2.AF", 
"MA", "NW", "Mu/Ku", "Fs", "2.AF", "NW")), .Names = c("ID", "A", "B"
), row.names = c("3", "5", "7", "10", "26", "29", "212", "213", 
"32", "35", "38", "39", "43", "44", "45", "48", "53", "56", "57", 
"59", "61", "65", "67", "68", "72", "75", "76", "77", "81", "86", 
"87", "88", "92", "93", "95", "98"), class = "data.frame")

In my original dataframe the values A1 -A4 (here called 1.PF - 4.PF) are not in the right order, this is what I want dcast to do (same as above)

在我最初的dataframe中，值是A1 -A4(这里称为1)PF - 4.PF)不是正确的顺序，这是我希望dcast做的(和上面一样)

ID | 1.PF | 2.PF | 3.PF | 4.PF
 1 | F    | NW   | DE   | S 
 2 | bel  | G    | N    | <NA>
 3 | F    | NW   | bel  | N

EDIT:

编辑:

I didn´t solve the dcast() problem, but I found a way to work around it: (reshape() function from the reshape package)

我也´t解决dcast()的问题,但我找到了一个解决方法:(重塑()函数从重塑包)

df <- reshape(df, idvar='ID', varying = NULL, timevar = 'value A', direction='wide')

3 个解决方案

#1

table and which would certainly be the answer to the first question:

这个表肯定是第一个问题的答案:

 names(table(dfrm$ID))[which(table(dfrm$ID) <4)]
#[1] "2"

As for the second question, maybe you should post the code that is generating the error. At the moment it's not clear what you are trying (and failing) to do.

至于第二个问题，也许您应该发布生成错误的代码。目前还不清楚你正在尝试(和失败)做什么。

EDIT:

编辑:

If I convert the factor variables to character variables I can get dcast to return the correct object, although my error is different than yours. I got the error in both reshape 1.1 and reshape 1.2.1 on R 2.14.1 on a Mac.

如果我将因子变量转换为字符变量，我可以让dcast返回正确的对象，尽管我的错误与您的不同。我在Mac上的r2.14.1上的重塑1.1和重塑1.2.1都有错误。

EDIT2: As it turned out the bug was fixed in the newest version of plyr. I get no error with reshape 1.2.1 running with plyr 1.7. You should also update those two packages and restart with a fresh session.

事实证明，错误是在最新版本的plyr中修复的。在使用plyr 1.7运行重塑1.2.1时，我没有得到任何错误。您还应该更新这两个包，并使用新的会话重新启动。

require(reshape2)
dfrm <- structure(list(ID = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3), value.A = structure(c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L, 3L, 4L), .Label = c("   A1    ", 
"   A2    ", "   A3    ", "   A4    "), class = "factor"), value.B = structure(c(2L, 
6L, 1L, 7L, 1L, 3L, 6L, 2L, 4L, 5L, 6L), .Label = c("   B", "   F", 
"   G", "   H", "   J", "   N", "   S"), class = "factor")), .Names = c("ID", 
"value.A", "value.B"), class = "data.frame", row.names = c(NA, 
-11L))
dcast(dfrm2, ID ~ value.A)
# Using value.B as value column: use value_var to override.
# Error in names(data) <- array_names(res$labels[[2]]) : 
#  'names' attribute [4] must be the same length as the vector [1]
# I first tried removing the leading and trainly spaces with:
dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""))
# Still got the error. Now try to leave as "character" type.

dfrm2 <- data.frame(lapply(dfrm, gsub, patt="^\\s+|\\s+$", rep=""),stringsAsFactors=FALSE)
str(dfrm2)
#-----------------
'data.frame':   11 obs. of  3 variables:
 $ ID     : chr  "1" "1" "1" "1" ...
 $ value.A: chr  "A1" "A2" "A3" "A4" ...
 $ value.B: chr  "F" "N" "B" "S" ...

dcast(dfrm2, ID ~ value.A)
#------------------
Using value.B as value column: use value_var to override.
  ID A1 A2 A3   A4
1  1  F  N  B    S
2  2  B  G  N <NA>
3  3  F  H  J    N

#2

You should mention that dcast is from the reshape2 package (not part of base R). I'm not sure what you're trying to do with it, but this should do what you ask for.

您应该提到dcast是来自reshape2包(不是base R的一部分)。

Make up data:

组成数据:

id <- rep(1:3,c(4,3,4))
d <- data.frame(id)
d <- ddply(d,.(id),
           function(x) {
             transform(x,A=paste("A",seq(nrow(x)),sep=""),
                       B=sample(LETTERS,nrow(x),replace=TRUE))
           })

Identify 'bad' groups:

确定“坏”组:

idtab <- table(d$id)
d2 <- d[!id %in% names(idtab)[idtab<4],]

While I can do this, if I try to use the full data set, dcast does the "right" thing (i.e. what I would have hoped for and what it sounds like you want), and fills in the missing values with an NA; I didn't get an error (I'm using reshape2 v 0.8.4 under a development version of R).

虽然我可以这样做，如果我尝试使用完整的数据集，dcast会执行“正确”的事情(即我所希望的和你想要的东西)，然后用一个NA来填充缺失的值;我没有得到错误(我在开发版本的R中使用reshape2 v0.8.4)。

library(reshape2)

With the sanitized data:

消毒的数据:

dcast(d2,id~A)
# Using B as value column: use value.var to override.
#   id A1 A2 A3 A4
# 1  1  B  X  P  E
# 2  3  F  Q  H  B

With the original data:

原始数据:

dcast(d,id~A)
# Using B as value column: use value.var to override.
#   id A1 A2 A3   A4
# 1  1  B  X  P    E
# 2  2  I  N  H <NA>
# 3  3  F  Q  H    B

#3

Try tapply. (If the third column is already character, as opposed to factor, then as.character can be omitted):

tapply试试。(如果第三列已经是字符，而不是因子，则为。字符可以省略):

tapply(as.character(DF[,3]), DF[-3], c)

#1