对R中的字符串字段进行分组

时间:2022-09-16 07:34:47

I have a data frame like this:

我有这样的数据框:

         date      time userid        status
1  02/25/2012  09:22:10   aabc     logged_in
2  02/25/2012  09:30:10   aabc    logged_out
3  02/25/2012  09:29:20   abbc    logged_out
4  02/25/2012  09:27:30    abc     logged_in
5  02/25/2012  09:26:29    abc  login_failed
6  02/25/2012  09:26:39    abc  login_failed
7  02/25/2012  09:26:52    abc  login_failed
8  02/25/2012  09:27:09    abc  login_failed
9  02/25/2012  09:27:20    abc  login_failed
10 02/25/2012  09:24:10   abdc     logged_in
11 02/25/2012  09:24:12   abdc    logged_out
12 02/25/2012  09:22:10   abhc     logged_in
13 02/25/2012  09:30:10   abuc     logged_in
14 02/25/2012  09:30:14   abuc    logged_out
15 02/25/2012  09:29:40    baa     logged_in

I want to the userids, status and "count" of the login_failures for each userid. I did this:

我想知道每个用户标识的login_failures的用户标识,状态和“计数”。我这样做了:

ddply(mytbl, c('userid', 'status'), function(x) c(count=nrow(x))) , but this gives the count for all userids. I want to restrict my output to only those userids whose status is 'login _failed'. Any ideas? I have seen questions on grouping by numeric fields, but, none on strings.

ddply(mytbl,c('userid','status'),function(x)c(count = nrow(x))),但是这给出了所有userid的计数。我想将输出限制为只有状态为“login _failed”的用户标识。有任何想法吗?我已经看到了关于按数字字段分组的问题,但是没有关于字符串的问题。

I am not very familiar with all the plyr features. It will great to see how this can be done using summarize, aggregate, sqldf, data.table etc. Slowly understanding each of them.

我对所有的plyr功能都不太熟悉。很高兴看到如何使用summarize,aggregate,sqldf,data.table等来慢慢理解它们。

Thanks Sri

谢谢斯里兰卡

4 个解决方案

#1


4  

require(data.table)
DT = as.data.table(mytbl)

DT[status=="login_failed", .N, by=userid]

To name the column :

要命名列:

DT[status=="login_failed", list(failed_logins=.N), by=userid]

#2


2  

Slightly different approach than @Maiasaura. I filter to just the failed logins and then summarize. The difference would be whether those userid's with logins, but no failed logins, appear in the final result with 0's or not.

与@Maiasaura略有不同的方法。我过滤到失败的登录,然后总结。不同之处在于那些具有登录但没有登录失败的用户ID是否出现在最终结果中是否为0。

ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
  summarise, failed_logins=length(status))

This gives

这给了

> ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
+   summarise, failed_logins=length(status))
  userid failed_logins
1    abc             5

To complete the approaches, if you want all the userid's:

要完成这些方法,如果您想要所有用户ID:

ddply(mytbl, .(userid), 
  summarise, failed_logins = sum(status=="login_failed"))

which gives

这使

> ddply(mytbl, .(userid), 
+   summarise, failed_logins = sum(status=="login_failed"))
  userid failed_logins
1   aabc             0
2   abbc             0
3    abc             5
4   abdc             0
5   abhc             0
6   abuc             0
7    baa             0

#3


2  

ddply(mytbl, .(userid), transform, 
failed_logins = length(which(status=="login_failed")))

Following up on Brian Diggs' point, I wrote the above because I assumed you wanted this information appended to the original dataset. If not, and you just need a summary, replace transform with summarise.

关注Brian Diggs的观点,我写了上面的内容,因为我假设您希望将此信息附加到原始数据集。如果没有,并且您只需要摘要,请使用summary来替换transform。

#4


2  

Here is a base R solution using aggregate():

这是使用aggregate()的基本R解决方案:

setNames(aggregate(status ~ userid,
                   mytbl[mytbl$status == "login_failed", ],
                   function(x) length(x)),
         c("userid", "failed_logins"))
#   userid failed_logins
# 1    abc             5

Update

Another useful function that comes to mind is ave() which you can use in the following way:

想到的另一个有用的功能是ave(),您可以通过以下方式使用它:

  • First, use ave() to add a new column into your dataset that takes the running count for each activity by each user. (Note: I had to make sure the "userid" and "status" columns were character class, not factors to get this to work for me).

    首先,使用ave()在数据集中添加一个新列,该列接受每个用户的每个活动的运行计数。 (注意:我必须确保“userid”和“status”列是字符类,而不是让它对我起作用的因素)。

    mytbl$status_seq <- ave(mytbl$status, mytbl$userid, 
                            mytbl$status, FUN = seq_along)
    head(mytbl)
    #         date     time userid       status status_seq
    # 1 02/25/2012 09:22:10   aabc    logged_in          1
    # 2 02/25/2012 09:30:10   aabc   logged_out          1
    # 3 02/25/2012 09:29:20   abbc   logged_out          1
    # 4 02/25/2012 09:27:30    abc    logged_in          1
    # 5 02/25/2012 09:26:29    abc login_failed          1
    # 6 02/25/2012 09:26:39    abc login_failed          2
    
  • Second, use aggregate() as demonstrated earlier, subsetting for the condition that you're interested in, and retrieving the max value.

    其次,使用前面演示的aggregate(),对您感兴趣的条件进行子集化,并检索最大值。

    aggregate(status_seq ~ userid, 
              mytbl[mytbl$status == "login_failed", ],
              function(x) max(x))
    #   userid status_seq
    # 1    abc          5
    
    aggregate(status_seq ~ userid, 
              mytbl[mytbl$status == "logged_out", ],
              function(x) max(x))
    #   userid status_seq
    # 1   aabc          1
    # 2   abbc          1
    # 3   abdc          1
    # 4   abuc          1
    

Note that ave() might be even more interesting if you used

请注意,如果您使用ave()可能会更有趣

mytbl$status_seq <- ave(mytbl$status, mytbl$date, mytbl$userid, mytbl$status, 
                        FUN = seq_along)

since that will reset the counter for each new day in your dataset.

因为这将重置数据集中每个新日期的计数器。

Finally (at the risk of sharing a solution that might be too obvious), since you're only interested in counts, you might want to explore table(), which gives you all the information at once:

最后(冒着共享可能太明显的解决方案的风险),因为您只对计数感兴趣,您可能想要探索table(),它会立即为您提供所有信息:

table(mytbl$userid, mytbl$status)
# 
#      logged_in logged_out login_failed
# aabc         1          1            0
# abbc         0          1            0
# abc          1          0            5
# abdc         1          1            0
# abhc         1          0            0
# abuc         1          1            0
# baa          1          0            0

#1


4  

require(data.table)
DT = as.data.table(mytbl)

DT[status=="login_failed", .N, by=userid]

To name the column :

要命名列:

DT[status=="login_failed", list(failed_logins=.N), by=userid]

#2


2  

Slightly different approach than @Maiasaura. I filter to just the failed logins and then summarize. The difference would be whether those userid's with logins, but no failed logins, appear in the final result with 0's or not.

与@Maiasaura略有不同的方法。我过滤到失败的登录,然后总结。不同之处在于那些具有登录但没有登录失败的用户ID是否出现在最终结果中是否为0。

ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
  summarise, failed_logins=length(status))

This gives

这给了

> ddply(mytbl[mytbl$status=="login_failed",], .(userid), 
+   summarise, failed_logins=length(status))
  userid failed_logins
1    abc             5

To complete the approaches, if you want all the userid's:

要完成这些方法,如果您想要所有用户ID:

ddply(mytbl, .(userid), 
  summarise, failed_logins = sum(status=="login_failed"))

which gives

这使

> ddply(mytbl, .(userid), 
+   summarise, failed_logins = sum(status=="login_failed"))
  userid failed_logins
1   aabc             0
2   abbc             0
3    abc             5
4   abdc             0
5   abhc             0
6   abuc             0
7    baa             0

#3


2  

ddply(mytbl, .(userid), transform, 
failed_logins = length(which(status=="login_failed")))

Following up on Brian Diggs' point, I wrote the above because I assumed you wanted this information appended to the original dataset. If not, and you just need a summary, replace transform with summarise.

关注Brian Diggs的观点,我写了上面的内容,因为我假设您希望将此信息附加到原始数据集。如果没有,并且您只需要摘要,请使用summary来替换transform。

#4


2  

Here is a base R solution using aggregate():

这是使用aggregate()的基本R解决方案:

setNames(aggregate(status ~ userid,
                   mytbl[mytbl$status == "login_failed", ],
                   function(x) length(x)),
         c("userid", "failed_logins"))
#   userid failed_logins
# 1    abc             5

Update

Another useful function that comes to mind is ave() which you can use in the following way:

想到的另一个有用的功能是ave(),您可以通过以下方式使用它:

  • First, use ave() to add a new column into your dataset that takes the running count for each activity by each user. (Note: I had to make sure the "userid" and "status" columns were character class, not factors to get this to work for me).

    首先,使用ave()在数据集中添加一个新列,该列接受每个用户的每个活动的运行计数。 (注意:我必须确保“userid”和“status”列是字符类,而不是让它对我起作用的因素)。

    mytbl$status_seq <- ave(mytbl$status, mytbl$userid, 
                            mytbl$status, FUN = seq_along)
    head(mytbl)
    #         date     time userid       status status_seq
    # 1 02/25/2012 09:22:10   aabc    logged_in          1
    # 2 02/25/2012 09:30:10   aabc   logged_out          1
    # 3 02/25/2012 09:29:20   abbc   logged_out          1
    # 4 02/25/2012 09:27:30    abc    logged_in          1
    # 5 02/25/2012 09:26:29    abc login_failed          1
    # 6 02/25/2012 09:26:39    abc login_failed          2
    
  • Second, use aggregate() as demonstrated earlier, subsetting for the condition that you're interested in, and retrieving the max value.

    其次,使用前面演示的aggregate(),对您感兴趣的条件进行子集化,并检索最大值。

    aggregate(status_seq ~ userid, 
              mytbl[mytbl$status == "login_failed", ],
              function(x) max(x))
    #   userid status_seq
    # 1    abc          5
    
    aggregate(status_seq ~ userid, 
              mytbl[mytbl$status == "logged_out", ],
              function(x) max(x))
    #   userid status_seq
    # 1   aabc          1
    # 2   abbc          1
    # 3   abdc          1
    # 4   abuc          1
    

Note that ave() might be even more interesting if you used

请注意,如果您使用ave()可能会更有趣

mytbl$status_seq <- ave(mytbl$status, mytbl$date, mytbl$userid, mytbl$status, 
                        FUN = seq_along)

since that will reset the counter for each new day in your dataset.

因为这将重置数据集中每个新日期的计数器。

Finally (at the risk of sharing a solution that might be too obvious), since you're only interested in counts, you might want to explore table(), which gives you all the information at once:

最后(冒着共享可能太明显的解决方案的风险),因为您只对计数感兴趣,您可能想要探索table(),它会立即为您提供所有信息:

table(mytbl$userid, mytbl$status)
# 
#      logged_in logged_out login_failed
# aabc         1          1            0
# abbc         0          1            0
# abc          1          0            5
# abdc         1          1            0
# abhc         1          0            0
# abuc         1          1            0
# baa          1          0            0