在列表类型data.table列中过滤数组

时间:2021-06-08 23:21:25

I'm trying to solve a problem in which for a new route (of a truck) that I just found, I check if that route was already a part of a previous route I have. For instance, assume my stored routes are in datatable routelist and node_list refers to the stored routes. I want to check the rows in which route (5,6,7,8) is part of.

我正在尝试解决一个问题,即我刚刚找到的新路线(卡车),我检查该路线是否已经是我之前路线的一部分。例如,假设我存储的路由在数据表routelist中,node_list指的是存储的路由。我想检查路由(5,6,7,8)所属的行。

library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5:8)
routelist[sum(item%in%unlist(packlist$node_list))==length(item)]

For the above check, all three rows are returned however only the third row should be returned. I could do it with the following for loop, but it s not fast and does not take order into account (and there should be a way to do it in a better way). The order of nodes in item is important and the list does not need to be consecutive i.e. item could be c(5,7,8) and should be returned in 3rd row while c(5,8,7) shouldnt return.

对于上面的检查,返回所有三行,但是只返回第三行。我可以通过以下for循环来做到这一点,但它并不快,并没有考虑到顺序(并且应该有一种方法以更好的方式做到这一点)。项目中节点的顺序很重要,列表不需要是连续的,即项目可以是c(5,7,8),应该在第3行返回,而c(5,8,7)不应该返回。

for(i in 1:3)
{
  if(sum(item%in%unlist(packlist[i]$node_list))==length(item))
    print(routelist[i])
}

3 个解决方案

#1


2  

There are two issues with OP's data.table approach here.

OP的data.table方法存在两个问题。

Missing by clause

routelist = data.table(id = 1:3, node_list = list(1:6, 4:7, 1:10))
item <- 5:8
routelist[, sum(item %in% unlist(node_list)) == length(item)] 

returns a single TRUE value because

返回单个TRUE值,因为

routelist[, unlist(node_list)]

returns a single vector

返回一个向量

 [1]  1  2  3  4  5  6  4  5  6  7  1  2  3  4  5  6  7  8  9 10

If grouped by id, we do get the desired result:

如果按ID分组,我们会得到所需的结果:

routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]
   id    V1
1:  1 FALSE
2:  2 FALSE
3:  3  TRUE

or

routelist[routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]$V1]
   id    node_list
1:  3 1,2,3,4,5,6,

%in% checks only appearance but not the order

The expression sum(item %in% unlist(node_list)) == length(item) doesn't take care of the order of elements in item.

表达式sum(项%%in%unlist(node_list))== length(item)不处理item中元素的顺序。

As order of the elements is important, the expression

由于元素的顺序很重要,所以表达式

isTRUE(all(diff(match(item, unlist(node_list))) > 0))

accounts for the order. match() returns the positions of the elements of item in node_list (or NA if not found). If the order in item is the same as in node_list then all differences in position must be positive. isTRUE() is required to cover the NA case.

说明了订单。 match()返回node_list中项目元素的位置(如果未找到,则返回NA)。如果item中的订单与node_list中的订单相同,则所有位置差异必须为正。需要isTRUE()来覆盖NA案例。

Thus,

item <- c(5, 7, 8)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]

returns

   id    node_list
1:  3 1,2,3,4,5,6,

despite the gap while

尽管存在差距

item <- c(5, 8, 7)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]

returns

Empty data.table (0 rows) of 2 cols: id,node_list

as requested due to the wrong order.

因订单错误而提出要求。

#2


1  

Solutions from dplyr and tidyr.

来自dplyr和tidyr的解决方案。

If the order is not important, the following approach may work. By examining the id column in routelist2, it is clear that id 3 is the one with the right condition.

如果订单不重要,可以使用以下方法。通过检查routelist2中的id列,很明显id 3是具有正确条件的id。

# Create example dataset
library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5:8)

# Solution 1
library(dplyr)
library(tidyr)

routelist2 <- routelist %>%
  unnest() %>%
  group_by(id) %>%
  filter(all(item %in% node_list)) %>%
  nest()

routelist2 
# A tibble: 1 x 2
     id              data
  <int>            <list>
1     3 <tibble [10 x 1]>

If the order is important, we may have to convert the route numbers to string than find the right string pattern. The following approach should work.

如果订单很重要,我们可能必须将路径编号转换为字符串,而不是找到正确的字符串模式。以下方法应该有效。

# Solution 2
item_str <- toString(item)

routelist3 <- routelist %>%
  rowwise() %>%
  mutate(node_list = toString(node_list)) %>%
  filter(grepl(item_str, node_list)) %>%
  ungroup()

routelist3
# A tibble: 1 x 2
     id                     node_list
  <int>                         <chr>
1     3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Update

The following considering the situation when nodes in item2 are not complete.

以下考虑第2项中的节点未完成的情况。

# Solution 3
library(dplyr)
library(tidyr)

item2 <- c(5, 7, 8)

routelist4 <- routelist %>%
  unnest() %>%
  group_by(id) %>%
  filter(all(item2 %in% node_list)) %>%
  filter(node_list %in% item2) %>%
  summarise(node_list = toString(node_list)) %>%
  filter(node_list == toString(item2))
routelist4
# A tibble: 1 x 2
     id node_list
  <int>     <chr>
1     3   5, 7, 8

#3


0  

Using loop (which is not elegant) it is possible to use the following check in the body. It does take order into account:

使用循环(不优雅)可以在正文中使用以下检查。它确实考虑到了顺序:

library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5,8,7)

for(i in 1:nrow(routelist))
{
  if(identical(intersect(unlist(routelist[i]$node_list),item),item)){
    print(routelist[i])  
  }

}

#1


2  

There are two issues with OP's data.table approach here.

OP的data.table方法存在两个问题。

Missing by clause

routelist = data.table(id = 1:3, node_list = list(1:6, 4:7, 1:10))
item <- 5:8
routelist[, sum(item %in% unlist(node_list)) == length(item)] 

returns a single TRUE value because

返回单个TRUE值,因为

routelist[, unlist(node_list)]

returns a single vector

返回一个向量

 [1]  1  2  3  4  5  6  4  5  6  7  1  2  3  4  5  6  7  8  9 10

If grouped by id, we do get the desired result:

如果按ID分组,我们会得到所需的结果:

routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]
   id    V1
1:  1 FALSE
2:  2 FALSE
3:  3  TRUE

or

routelist[routelist[, sum(item %in% unlist(node_list)) == length(item), by = id]$V1]
   id    node_list
1:  3 1,2,3,4,5,6,

%in% checks only appearance but not the order

The expression sum(item %in% unlist(node_list)) == length(item) doesn't take care of the order of elements in item.

表达式sum(项%%in%unlist(node_list))== length(item)不处理item中元素的顺序。

As order of the elements is important, the expression

由于元素的顺序很重要,所以表达式

isTRUE(all(diff(match(item, unlist(node_list))) > 0))

accounts for the order. match() returns the positions of the elements of item in node_list (or NA if not found). If the order in item is the same as in node_list then all differences in position must be positive. isTRUE() is required to cover the NA case.

说明了订单。 match()返回node_list中项目元素的位置(如果未找到,则返回NA)。如果item中的订单与node_list中的订单相同,则所有位置差异必须为正。需要isTRUE()来覆盖NA案例。

Thus,

item <- c(5, 7, 8)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]

returns

   id    node_list
1:  3 1,2,3,4,5,6,

despite the gap while

尽管存在差距

item <- c(5, 8, 7)
routelist[routelist[, isTRUE(all(diff(match(item, unlist(node_list))) > 0)), by = id]$V1]

returns

Empty data.table (0 rows) of 2 cols: id,node_list

as requested due to the wrong order.

因订单错误而提出要求。

#2


1  

Solutions from dplyr and tidyr.

来自dplyr和tidyr的解决方案。

If the order is not important, the following approach may work. By examining the id column in routelist2, it is clear that id 3 is the one with the right condition.

如果订单不重要,可以使用以下方法。通过检查routelist2中的id列,很明显id 3是具有正确条件的id。

# Create example dataset
library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5:8)

# Solution 1
library(dplyr)
library(tidyr)

routelist2 <- routelist %>%
  unnest() %>%
  group_by(id) %>%
  filter(all(item %in% node_list)) %>%
  nest()

routelist2 
# A tibble: 1 x 2
     id              data
  <int>            <list>
1     3 <tibble [10 x 1]>

If the order is important, we may have to convert the route numbers to string than find the right string pattern. The following approach should work.

如果订单很重要,我们可能必须将路径编号转换为字符串,而不是找到正确的字符串模式。以下方法应该有效。

# Solution 2
item_str <- toString(item)

routelist3 <- routelist %>%
  rowwise() %>%
  mutate(node_list = toString(node_list)) %>%
  filter(grepl(item_str, node_list)) %>%
  ungroup()

routelist3
# A tibble: 1 x 2
     id                     node_list
  <int>                         <chr>
1     3 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Update

The following considering the situation when nodes in item2 are not complete.

以下考虑第2项中的节点未完成的情况。

# Solution 3
library(dplyr)
library(tidyr)

item2 <- c(5, 7, 8)

routelist4 <- routelist %>%
  unnest() %>%
  group_by(id) %>%
  filter(all(item2 %in% node_list)) %>%
  filter(node_list %in% item2) %>%
  summarise(node_list = toString(node_list)) %>%
  filter(node_list == toString(item2))
routelist4
# A tibble: 1 x 2
     id node_list
  <int>     <chr>
1     3   5, 7, 8

#3


0  

Using loop (which is not elegant) it is possible to use the following check in the body. It does take order into account:

使用循环(不优雅)可以在正文中使用以下检查。它确实考虑到了顺序:

library(data.table)
routelist=data.table(id=c(1:3),node_list=list(c(1:6),c(4:7),c(1:10)))
item<-c(5,8,7)

for(i in 1:nrow(routelist))
{
  if(identical(intersect(unlist(routelist[i]$node_list),item),item)){
    print(routelist[i])  
  }

}