R数据。表按类别递增,并将NA设置为最后一个非缺失值

时间:2022-01-13 20:39:00

[Background]

[背景]

I have some data of online activities from a group of users:

我有一些来自一群用户的在线活动数据:

  • userId indicates the ID of the user.
  • userId表示用户的ID。
  • pageType indicates the current page the user is on. home indicates the homepage, while content indicates the content page.
  • pageType指示用户正在访问的当前页面。home表示主页,而content表示内容页面。
  • The pages are already sorted by time, so row 1 happens before row 2, and row 2 happens before row 3, ...
  • 页面已经按时间排序,所以第1行在第2行之前,第2行在第3行之前……
  • Actual data has roughly 2 million rows, and 8 page types. userId is a 36-charactered java.util.UUID object.
  • 实际数据大约有200万行,8个页面类型。userId是一个36字符的java.util。对象UUID。

[Goal]

[目的]

I want to generate a new column for each pageType and count the number of previous page views (not including current) of the the exact same type.

我想为每个pageType生成一个新的列,并计算相同类型的以前的页面视图(不包括当前视图)的数量。

[Sample Data]

(样本数据)

To generate a sample of the actual data:

生成实际数据的样本:

library(data.table)
DT <- data.table("userId"=rep(1:3, each=10),
                 "pageType"=c("home", "content", "home", "content", "home", "home", "content", "content", "home", "home",
                              "content", "content", "home", "home", "content", "home", "home", "content", "home", "content",
                              "home", "home", "content", "content", "home", "home", "content", "content", "home", "content"))

> DT
    userId pageType
 1:      1     home
 2:      1  content
 3:      1     home
 4:      1  content
 5:      1     home
 6:      1     home
 7:      1  content
 8:      1  content
 9:      1     home
10:      1     home
...    ...      ...

[My Attempts]

(我的尝试)

I have tried to solve this problem in two ways, but both of them are too slow. I also feel my solution didn't use data.table the way it is designed for.

我试过用两种方法来解决这个问题,但这两种方法都太慢了。我也觉得我的解决方案没有使用数据。按设计的方式摆放。

Solution I

解我

  1. Filter by pageType and increment by userId.
  2. 按页面类型筛选,按用户id递增。
  3. Set missing values for the other pageType.
  4. 为其他pageType设置缺失值。

Below is the code:

下面是代码:

FixPageView <- function(data, type) {
  val <- 0
  for (i in 1:nrow(data)) {
    if (is.na(data[[type]][i])) {
      set(data, i, type, val)
    } else {
      val <- data[[type]][i]
    }
  }
}
DT[pageType=="home", numHomePagesViewed:=0:(.N-1), by=userId]
DT[pageType=="content", numContentPagesViewed:=0:(.N-1), by=userId]
FixPageView(DT, "numHomePagesViewed")
FixPageView(DT, "numContentPagesViewed")

> DT
    userId pageType numHomePagesViewed numContentPagesViewed
 1:      1     home                  0                     0
 2:      1  content                  0                     0
 3:      1     home                  1                     0
 4:      1  content                  1                     1
 5:      1     home                  2                     1
 6:      1     home                  3                     1
 7:      1  content                  3                     2
 8:      1  content                  3                     3
 9:      1     home                  4                     3
10:      1     home                  5                     3
...    ...      ...                ...                   ...

Solution II

解决方案二

Double for loop and set it row by row.

双for循环,并将它按行设置。

DT[, numHomePagesViewed := 0L][, numContentPagesViewed := 0L]
for (i in unique(DT$userId)) {
  home_inc <- -1L
  content_inc <- -1L
  for (j in 1L:nrow(DT[userId==i])) {
    if (DT$pageType[(i-1L)*10L + j] == "home") {
      home_inc <- home_inc + 1L
      set(DT, (i-1L)*10L + j, "numHomePagesViewed", home_inc)
    } else {
      set(DT, (i-1L)*10L + j, "numHomePagesViewed", max(0, home_inc))
    }
    if (DT$pageType[(i-1L)*10L + j] == "content") {
      content_inc <- content_inc + 1L
      set(DT, (i-1L)*10L + j, "numContentPagesViewed", content_inc)
    } else {
      set(DT, (i-1L)*10L + j, "numContentPagesViewed", max(0, content_inc))
    }
  }
}

> DT
    userId pageType numHomePagesViewed numContentPagesViewed
 1:      1     home                  0                     0
 2:      1  content                  0                     0
 3:      1     home                  1                     0
 4:      1  content                  1                     1
 5:      1     home                  2                     1
 6:      1     home                  3                     1
 7:      1  content                  3                     2
 8:      1  content                  3                     3
 9:      1     home                  4                     3
10:      1     home                  5                     3
...    ...      ...                ...                   ...

[Question]

[问题]

  1. What can I do to improve the speed?
  2. 我能做些什么来提高速度呢?
  3. Is there a more "data.table" way to solve this problem?
  4. 还有更多的“数据”吗?“解决这个问题的方法?”

1 个解决方案

#1


4  

I'd try:

我试一试:

DT[,lapply(unique(pageType),
   function(x) pmax(cumsum(pageType==x)-1,0)),by=userId]

Next, you have to rename the obtained columns.

接下来,您必须重命名已获得的列。

As suggested in the comments, you can assign the names with one line:

正如评论中提到的,你可以用一行来命名:

DT[, paste0("num",unique(DT$pageType),"PagesViewed") := 
      lapply(unique(pageType), function(x) pmax(cumsum(pageType==x)-1,0)), by=userId]

#1


4  

I'd try:

我试一试:

DT[,lapply(unique(pageType),
   function(x) pmax(cumsum(pageType==x)-1,0)),by=userId]

Next, you have to rename the obtained columns.

接下来,您必须重命名已获得的列。

As suggested in the comments, you can assign the names with one line:

正如评论中提到的,你可以用一行来命名:

DT[, paste0("num",unique(DT$pageType),"PagesViewed") := 
      lapply(unique(pageType), function(x) pmax(cumsum(pageType==x)-1,0)), by=userId]