使用R使用adjecent相关为有序元素分组变量

时间:2022-03-03 01:07:46

I have "markr" variable which are arranged in order and correlation between subsequent members of "markr" is provided in corr variables.

我有“markr”变量按顺序排列,并且“corr”的后续成员之间的相关性在corr变量中提供。

markr <- c("A", "B", "C", "D", "E",  "g", "A1", "B1", "cc", "dd", 
     "f", "gg", "h", "K")
corr <- c(     1,   1,   1,   1, 0.96,   0.5,  0.96,        1 ,   1 ,  
       1 ,  0.85, 0.99, 1)

I need to group markr based on corr without changing the order of members of markr. The group can be better explained by following diagram:

我需要根据corr对markr进行分组而不改变markr成员的顺序。通过以下图表可以更好地解释该组:

使用R使用adjecent相关为有序元素分组变量

The individual members of abject markr that have corr greater than 0.95 will be in one group. Starting from first value when the corr drops to less than 0.95, then second group starts and continues till the corr drops again below 0.95, the process continues to end of the data. The group variable are named by first and last members in the group for example - A-g, A1-f, gg-k.

具有大于0.95的corr的卑微标记的个体成员将在一个组中。当corr下降到小于0.95时从第一个值开始,然后第二组开始并继续直到corr再次下降到0.95以下,该过程继续到数据结束。组变量由组中的第一个和最后一个成员命名,例如-A-g,A1-f,gg-k。

Thus expected output is.

因此预期的产出是。

markr <- c("A", "B", "C", "D", "E",  "g", 
           "A1", "B1", "cc", "dd", "f", 
           "gg", "h", "K")
group <- c("A-g", "A-g", "A-g", "A-g","A-g", "A-g", 
           "A1-f",  "A1-f",  "A1-f",  "A1-f","A1-f", 
            "gg-k", "gg-k", "gg-k")
dataf <- data.frame (markr, group) 

dataf 

 markr group
1      A   A-g
2      B   A-g
3      C   A-g
4      D   A-g
5      E   A-g
6      g   A-g
7     A1  A1-f
8     B1  A1-f
9     cc  A1-f
10    dd  A1-f
11     f  A1-f
12    gg  gg-k
13     h  gg-k
14     K  gg-k

How can I automate this process as I have very large such dataset.

我如何自动化这个过程,因为我有这么大的数据集。

1 个解决方案

#1


1  

The number of the group is the number of values under 0.95 we have seen so far:

该组的数量是到目前为止我们看到的0.95以下的值的数量:

d1 <- data.frame(
  marker = markr,
  group = cumsum(c(1, corr < .95))
)

For the group names, you can use ddply the cut the data.frame into pieces, one per group: it is then easy to extract the first and last element.

对于组名,您可以使用ddply将data.frame剪切成片,每组一个:然后很容易提取第一个和最后一个元素。

library(plyr)
d2 <- ddply( 
  d1, "group", summarize, 
  group_name=paste(head(marker,1), tail(marker,1), sep="-")
)
d <- merge(d1, d2, by="group")

#1


1  

The number of the group is the number of values under 0.95 we have seen so far:

该组的数量是到目前为止我们看到的0.95以下的值的数量:

d1 <- data.frame(
  marker = markr,
  group = cumsum(c(1, corr < .95))
)

For the group names, you can use ddply the cut the data.frame into pieces, one per group: it is then easy to extract the first and last element.

对于组名,您可以使用ddply将data.frame剪切成片,每组一个:然后很容易提取第一个和最后一个元素。

library(plyr)
d2 <- ddply( 
  d1, "group", summarize, 
  group_name=paste(head(marker,1), tail(marker,1), sep="-")
)
d <- merge(d1, d2, by="group")