R:在一个数据帧中,使所有因子列的重复层次都是独一无二的。

时间:2021-10-21 20:24:24

For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.

已经有好几天了,我在R中遇到了一个问题,试图用一个循环在数据帧中的多个因素列中复制一个重复的级别。这是一个大项目的一部分。

I have more than 200 SPSS data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:

我有超过200个SPSS数据集,其中的案例数量在4000到23,000之间,变量的数量在120到1200之间变化(在这里可以找到SPSS数据集的摘录)。这些文件包含数字和因子变量,许多因素都有重复的级别。我用阅读。来自外包装的spss将它们导入数据帧,保留值标签,因为我需要它们以供进一步使用。在进口期间R警告我有关因素列的重复水平:

> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE,  :
  /tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

The data frame, exported as .RData, can be found here. When I use table (for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:

可以在这里找到作为. rdata导出的数据帧。当我使用表(例如)来获取任何因子列的每个级别的计数时,所有重复的级别都会显示出来,但是所有重复级别的计数都被添加到重复级别的第一次出现,而所有其他的0则返回:

> table(adn[["adn01"]], useNA = "ifany")
  Incorrect         Incorrect Partially correct Partially correct 
          8                 0                 4                 0 
    Correct              <NA> 
          2                 1 
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

I know I can easily treat the factor as.numeric when calling table. However, I need the level names displayed in the output. I can use make.unique to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:

我知道我可以很容易地把这个因素看成。数字当调用表。但是,我需要在输出中显示级别名。我可以使用。独特的使个别因素列的层次独特,在重复的层次末尾附加一个数字:

> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")

Works like a charm. Then table shows me the correct counts:

就像一个魅力。然后表格告诉我正确的计数:

> table(adn[["adn01"]], useNA = "ifany")

          Incorrect         Incorrect 1   Partially correct 
                  5                   3                   1 
Partially correct 1             Correct                <NA> 
                  3                   2                   1 

However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table requires names. I have tried the following:

然而,在200多个文件中,每个因素列都要这样做,其中变量的数量在120到1200之间,这将是一个终生的任务。如果文件改变了,我将不得不重新做所有的事情。我天真地认为通过ccolums是很容易的。然而,。表需要的名字。我试过以下几点:

> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) : 
  'names' must be a character vector

No luck. I have tried many other things in the last days, including classical for loops. Still the same: 'names' must be a character vector. I guess the problem is in indexing the attribute levels of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?

没有运气。在过去的日子里,我尝试了很多其他的东西,包括经典的循环。仍然是相同的:“名称”必须是一个字符向量。我想问题在于索引列的属性级别,这是一个列表组件,但我不知道是什么。另外一个问题可能不是所有的列都是因数。有人可以帮忙吗?

EDIT:

编辑:

The solution provided by akrun works perfectly. Thank you once again!

akrun提供的解决方案非常完美。再一次感谢你!

1 个解决方案

#1


1  

Try

试一试

 load('adn.RData')
 indx <- sapply(adn, is.factor)
 adn[indx] <- lapply(adn[indx], function(x) {
                   levels(x) <- make.unique(levels(x))
                   x })


 table(adn[['adn01']], useNA='ifany')

 #     Incorrect         Incorrect.1   Partially correct Partially correct.1 
 #             5                   3                   1                   3 
 #       Correct                <NA> 
 #             2                   1 


  table(adn[['adn03']], useNA='ifany')

  #  Incorrect Partially correct           Correct              <NA> 
  #          6                 3                 5                 1 

Update

If you have multiple files, you can read the files into a list and then do the processing on the list. For example, considering that the files are in the working directory.

如果您有多个文件,您可以将这些文件读入一个列表,然后在列表中进行处理。例如,考虑到文件在工作目录中。

files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
          use.missings = TRUE, to.data.frame = TRUE) #not tested

For testing purposes, I am creating lst1 with the same dataset adn.

为了测试目的,我使用相同的数据集adn创建lst1。

adn1 <- adn
lst1 <- list(adn, adn1)

Now, you are apply the make.unique for each list element

现在,应用make。对每个列表元素都是唯一的。

lst2 <- lapply(lst1, function(dat) {
                  indx <- sapply(dat, is.factor)
                  dat[indx] <- lapply(dat[indx], function(x){
                           levels(x) <- make.unique(levels(x))
                            x})
                          dat})


  lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
  # [[1]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

  # [[2]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

#1


1  

Try

试一试

 load('adn.RData')
 indx <- sapply(adn, is.factor)
 adn[indx] <- lapply(adn[indx], function(x) {
                   levels(x) <- make.unique(levels(x))
                   x })


 table(adn[['adn01']], useNA='ifany')

 #     Incorrect         Incorrect.1   Partially correct Partially correct.1 
 #             5                   3                   1                   3 
 #       Correct                <NA> 
 #             2                   1 


  table(adn[['adn03']], useNA='ifany')

  #  Incorrect Partially correct           Correct              <NA> 
  #          6                 3                 5                 1 

Update

If you have multiple files, you can read the files into a list and then do the processing on the list. For example, considering that the files are in the working directory.

如果您有多个文件,您可以将这些文件读入一个列表,然后在列表中进行处理。例如,考虑到文件在工作目录中。

files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
          use.missings = TRUE, to.data.frame = TRUE) #not tested

For testing purposes, I am creating lst1 with the same dataset adn.

为了测试目的,我使用相同的数据集adn创建lst1。

adn1 <- adn
lst1 <- list(adn, adn1)

Now, you are apply the make.unique for each list element

现在,应用make。对每个列表元素都是唯一的。

lst2 <- lapply(lst1, function(dat) {
                  indx <- sapply(dat, is.factor)
                  dat[indx] <- lapply(dat[indx], function(x){
                           levels(x) <- make.unique(levels(x))
                            x})
                          dat})


  lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
  # [[1]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1 

  # [[2]]

  #    Incorrect         Incorrect.1   Partially correct Partially correct.1 
  #            5                   3                   1                   3 
  #      Correct                <NA> 
  #            2                   1