
时间:2022-03-03 01:07:46

I have "markr" variable which are arranged in order and correlation between subsequent members of "markr" is provided in corr variables.


markr <- c("A", "B", "C", "D", "E",  "g", "A1", "B1", "cc", "dd", 
     "f", "gg", "h", "K")
corr <- c(     1,   1,   1,   1, 0.96,   0.5,  0.96,        1 ,   1 ,  
       1 ,  0.85, 0.99, 1)

I need to group markr based on corr without changing the order of members of markr. The group can be better explained by following diagram:



The individual members of abject markr that have corr greater than 0.95 will be in one group. Starting from first value when the corr drops to less than 0.95, then second group starts and continues till the corr drops again below 0.95, the process continues to end of the data. The group variable are named by first and last members in the group for example - A-g, A1-f, gg-k.


Thus expected output is.


markr <- c("A", "B", "C", "D", "E",  "g", 
           "A1", "B1", "cc", "dd", "f", 
           "gg", "h", "K")
group <- c("A-g", "A-g", "A-g", "A-g","A-g", "A-g", 
           "A1-f",  "A1-f",  "A1-f",  "A1-f","A1-f", 
            "gg-k", "gg-k", "gg-k")
dataf <- data.frame (markr, group) 


 markr group
1      A   A-g
2      B   A-g
3      C   A-g
4      D   A-g
5      E   A-g
6      g   A-g
7     A1  A1-f
8     B1  A1-f
9     cc  A1-f
10    dd  A1-f
11     f  A1-f
12    gg  gg-k
13     h  gg-k
14     K  gg-k

How can I automate this process as I have very large such dataset.


1 个解决方案



The number of the group is the number of values under 0.95 we have seen so far:


d1 <- data.frame(
  marker = markr,
  group = cumsum(c(1, corr < .95))

For the group names, you can use ddply the cut the data.frame into pieces, one per group: it is then easy to extract the first and last element.


d2 <- ddply( 
  d1, "group", summarize, 
  group_name=paste(head(marker,1), tail(marker,1), sep="-")
d <- merge(d1, d2, by="group")



The number of the group is the number of values under 0.95 we have seen so far:


d1 <- data.frame(
  marker = markr,
  group = cumsum(c(1, corr < .95))

For the group names, you can use ddply the cut the data.frame into pieces, one per group: it is then easy to extract the first and last element.


d2 <- ddply( 
  d1, "group", summarize, 
  group_name=paste(head(marker,1), tail(marker,1), sep="-")
d <- merge(d1, d2, by="group")