从data.frame或data.table中构建一个正方形邻接矩阵。

时间:2022-07-20 21:05:36

I am trying to build a square adjacency matrix from a data.table. Here is a reproducible example of what I already have :

我正在尝试从data.table中构建一个正方形邻接矩阵。这里有一个可复制的例子说明我已经拥有的:

require(data.table)
require(plyr)
require(reshape2)
# Build a mock data.table
dt <- data.table(Source=as.character(rep(letters[1:3],2)),Target=as.character(rep(letters[4:2],2)))
dt
#   Source Target
#1:      a      d
#2:      b      c
#3:      c      b
#4:      a      d
#5:      b      c
#6:      c      b
sry <- ddply(dt, .(Source,Target), summarize, Frequency=length(Source))
sry
#  Source Target Frequency
#1      a      d         2
#2      b      c         2
#3      c      b         2
mtx <- as.matrix(dcast(sry, Source ~ Target, value.var="Frequency", fill=0))
rownames(mtx) <- mtx[,1]
mtx <- mtx[,2:ncol(mtx)]
mtx
#  b   c   d
#a "0" "0" "2"
#b "0" "2" "0"
#c "2" "0" "0"

Now, this is very close to what I want to get, except that I would like to have all the nodes represented in both dimensions, like :

现在,这很接近我想要得到的,除了我希望所有的节点都表示在两个维度中,比如:

  a b c d
a 0 0 0 2
b 0 0 2 0
c 0 2 0 0
d 0 0 0 0

Note that I am working on quite large data, so I'd like to find an efficient solution for this.

请注意,我正在处理相当大的数据,因此我希望找到一个有效的解决方案。

Thank you for your help.

谢谢你的帮助。


SOLUTIONS (EDIT) :

解决方案(编辑):

Given the quality of the solutions offered and the size of my dataset, I benchmarked all the solutions.

考虑到所提供的解决方案的质量和数据集的大小,我对所有解决方案进行了基准测试。

#The bench was made with a 1-million-row sample from my original dataset
library(data.table)
aa <- fread("small2.csv",sep="^")
dt <- aa[,c(8,9),with=F]
colnames(dt) <- c("Source","Target")
dim(dt)
#[1] 1000001       2
levs <- unique(unlist(dt, use.names=F))
length(levs)
#[1] 2222

Given this data, the desired output is a 2222*2222 matrix (2222*2223 solutions where the first column contains the row names are also obviously acceptable).

考虑到这些数据,所需的输出是2222*2222矩阵(2222*2223解,其中第一列包含行名,显然也可以接受)。

# Ananda Mahto's first solution
am1 <- function() {
    table(dt[, lapply(.SD, factor, levs)])
}
dim(am1())
#[1] 2222 2222

# Ananda Mahto's second solution
am2 <- function() {
    as.matrix(dcast(dt[, lapply(.SD, factor, levs)], Source~Target, drop=F, value.var="Target", fun.aggregate=length))
}
dim(am2())
#[1] 2222 2223

library(dplyr)
library(tidyr)
# Akrun's solution
akr <- function() {
    dt %>%
       mutate_each(funs(factor(., levs))) %>%
       group_by(Source, Target) %>%
       tally() %>%
       spread(Target, n, drop=FALSE, fill=0)
}
dim(akr())
#[1] 2222 2223

library(igraph)
# Carlos Cinelli's solution
cc <- function() {
    g <- graph_from_data_frame(dt)
    as_adjacency_matrix(g)
}
dim(cc())
#[1] 2222 2222

And the result of the benchmark is…

这个基准的结果是……

library(rbenchmark)
benchmark(am1(), am2(), akr(), cc(), replications=75)
#    test replications elapsed relative user.self sys.self user.child sys.child
# 1 am1()           75  15.939    1.000    15.636    0.280          0         0
# 2 am2()           75 111.558    6.999   109.345    1.616          0         0
# 3 akr()           75  43.786    2.747    42.463    1.134          0         0
# 4  cc()           75  46.193    2.898    45.532    0.563          0         0

3 个解决方案

#1


7  

It sounds like you're just looking for table, but you should make sure that both columns have the same factor levels:

听起来你只是在找表格,但你应该确保两列都有相同的因子水平:

levs <- unique(unlist(dt, use.names = FALSE))
table(lapply(dt, factor, levs))
#       Target
# Source a b c d
#      a 0 0 0 2
#      b 0 0 2 0
#      c 0 2 0 0
#      d 0 0 0 0

I don't know if it would offer any speed improvements, but you could also use dcast from "data.table":

我不知道它是否能提高速度,但你也可以使用“data.table”中的dcast:

dcast(lapply(dt, factor, levs), Source ~ Target, drop = FALSE,
      value.var = "Target", fun.aggregate = length)

#2


3  

You can also use igraph. Since you said that you are dealing with large data, igraph has the advantage that it uses sparse matrices:

你也可以使用igraph。既然你说你要处理的是大数据,igraph的优势在于它使用的是稀疏矩阵:

library(igraph)
g <- graph_from_data_frame(dt)
as_adjacency_matrix(g)
4 x 4 sparse Matrix of class "dgCMatrix"
  a b c d
a . . . 2
b . . 2 .
c . 2 . .
d . . . .

#3


1  

We can use dplyr/tidyr

我们可以使用dplyr / tidyr

library(dplyr)
library(tidyr)
dt %>% 
   mutate_each(funs(factor(., letters[1:4]))) %>% 
   group_by(Source, Target) %>%
   tally() %>%
   spread(Target, n, drop=FALSE, fill=0)
#  Source     a     b     c     d
#   (fctr) (dbl) (dbl) (dbl) (dbl)
#1      a     0     0     0     2
#2      b     0     0     2     0
#3      c     0     2     0     0
#4      d     0     0     0     0

#1


7  

It sounds like you're just looking for table, but you should make sure that both columns have the same factor levels:

听起来你只是在找表格,但你应该确保两列都有相同的因子水平:

levs <- unique(unlist(dt, use.names = FALSE))
table(lapply(dt, factor, levs))
#       Target
# Source a b c d
#      a 0 0 0 2
#      b 0 0 2 0
#      c 0 2 0 0
#      d 0 0 0 0

I don't know if it would offer any speed improvements, but you could also use dcast from "data.table":

我不知道它是否能提高速度,但你也可以使用“data.table”中的dcast:

dcast(lapply(dt, factor, levs), Source ~ Target, drop = FALSE,
      value.var = "Target", fun.aggregate = length)

#2


3  

You can also use igraph. Since you said that you are dealing with large data, igraph has the advantage that it uses sparse matrices:

你也可以使用igraph。既然你说你要处理的是大数据,igraph的优势在于它使用的是稀疏矩阵:

library(igraph)
g <- graph_from_data_frame(dt)
as_adjacency_matrix(g)
4 x 4 sparse Matrix of class "dgCMatrix"
  a b c d
a . . . 2
b . . 2 .
c . 2 . .
d . . . .

#3


1  

We can use dplyr/tidyr

我们可以使用dplyr / tidyr

library(dplyr)
library(tidyr)
dt %>% 
   mutate_each(funs(factor(., letters[1:4]))) %>% 
   group_by(Source, Target) %>%
   tally() %>%
   spread(Target, n, drop=FALSE, fill=0)
#  Source     a     b     c     d
#   (fctr) (dbl) (dbl) (dbl) (dbl)
#1      a     0     0     0     2
#2      b     0     0     2     0
#3      c     0     2     0     0
#4      d     0     0     0     0