dataframe行的时间差异

时间:2023-02-07 15:20:31

I have been zoning in the R part of * for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem. I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :

我已经在*的R部分划分了很长一段时间,寻找一个合适的答案,但是什么都没有看到似乎适用于我的问题。我有一个这种格式的数据集(我将它调整为似乎是最容易处理的方式,但是stop_sequence值通常只是每个stop的增量数字):

route_short_name    trip_id                     direction_id    departure_time  stop_sequence 
 33A                1.1598.0-33A-b12-1.451.I            1       16:15:00         start
 33A                1.1598.0-33A-b12-1.451.I            1       16:57:00           end
 41C                10.3265.0-41C-b12-1.277.I           1       08:35:00         start
 41C                10.3265.0-41C-b12-1.277.I           1       09:26:00           end
 41C                100.3260.0-41C-b12-1.276.I          1       09:40:00         start
 41C                100.3260.0-41C-b12-1.276.I          1       10:53:00           end
 114                1000.987.0-114-b12-1.86.O           0       21:35:00         start
 114                1000.987.0-114-b12-1.86.O           0       22:02:00           end
 39                 10000.2877.0-39-b12-1.242.I         1       11:15:00         start
 39                 10000.2877.0-39-b12-1.242.I         1       12:30:00           end

It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:

它基本上是一个总线旅行数据集。我想做的就是设法弄清楚每次旅行的持续时间,就像这样:

route_short_name    trip_id                    direction_id    duration
33A                1.1598.0-33A-b12-1.451.I            1        42
41C                10.3265.0-41C-b12-1.277.I           1        51
41C                100.3260.0-41C-b12-1.276.I          1        73
114                1000.987.0-114-b12-1.86.O           0        27
39                 10000.2877.0-39-b12-1.242.I         1        75

I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.

我尝试过很多东西,但从来没有尝试过使用trip_id对数据进行分组,然后每次都处理这两个值。我一定是误解了什么,但我不知道是什么意思。

Does anyone have a clue?

有人知道吗?

2 个解决方案

#1


1  

We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.

我们也可以不转换为“wide”格式(假设‘stop_sequence’是‘start’,后面跟着‘end’,分别对应于‘route_short_name’、‘trip_id’和‘direction_id’。

Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'

将“departure_time”转换为datetime列,按“route_short_name”、“trip_id”和“direction_id”分组,在“第一个”“离开时间”的情况下获得最后一个“离开时间”的传播时间。

df1 %>%
    mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
    group_by(route_short_name, trip_id, direction_id) %>%
    summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups:   route_short_name, trip_id [?]
#  route_short_name                     trip_id direction_id duration
#             <chr>                       <chr>        <int>    <dbl>
#1              114   1000.987.0-114-b12-1.86.O            0       27
#2              33A    1.1598.0-33A-b12-1.451.I            1       42
#3               39 10000.2877.0-39-b12-1.242.I            1       75
#4              41C   10.3265.0-41C-b12-1.277.I            1       51
#5              41C  100.3260.0-41C-b12-1.276.I            1       73

#2


1  

Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.

试试这个。现在您有了“long”格式的dataframe,但是最好有“wide”格式来计算时差。在tidyverse包中使用扩展函数将使您的数据从长到宽。在这里,您可以使用mutate函数来添加您想要的新列。数字(扩散时间(结束,开始))将保持不同的单位在分钟。

library(tidyverse)

wide_df <- 
  spread(your_df,key = stop_sequence, value = departure_time) %>% 
  mutate(timediff = as.numeric(difftime(end,start)))

If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book

如果您想了解更多关于“整洁”数据(以及传播和收集)的信息,请参见哈德利的书中的链接

#1


1  

We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.

我们也可以不转换为“wide”格式(假设‘stop_sequence’是‘start’,后面跟着‘end’,分别对应于‘route_short_name’、‘trip_id’和‘direction_id’。

Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'

将“departure_time”转换为datetime列,按“route_short_name”、“trip_id”和“direction_id”分组,在“第一个”“离开时间”的情况下获得最后一个“离开时间”的传播时间。

df1 %>%
    mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
    group_by(route_short_name, trip_id, direction_id) %>%
    summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups:   route_short_name, trip_id [?]
#  route_short_name                     trip_id direction_id duration
#             <chr>                       <chr>        <int>    <dbl>
#1              114   1000.987.0-114-b12-1.86.O            0       27
#2              33A    1.1598.0-33A-b12-1.451.I            1       42
#3               39 10000.2877.0-39-b12-1.242.I            1       75
#4              41C   10.3265.0-41C-b12-1.277.I            1       51
#5              41C  100.3260.0-41C-b12-1.276.I            1       73

#2


1  

Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.

试试这个。现在您有了“long”格式的dataframe,但是最好有“wide”格式来计算时差。在tidyverse包中使用扩展函数将使您的数据从长到宽。在这里,您可以使用mutate函数来添加您想要的新列。数字(扩散时间(结束,开始))将保持不同的单位在分钟。

library(tidyverse)

wide_df <- 
  spread(your_df,key = stop_sequence, value = departure_time) %>% 
  mutate(timediff = as.numeric(difftime(end,start)))

If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book

如果您想了解更多关于“整洁”数据(以及传播和收集)的信息,请参见哈德利的书中的链接