
时间:2023-02-07 15:20:31

I have been zoning in the R part of * for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem. I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :


route_short_name    trip_id                     direction_id    departure_time  stop_sequence 
 33A                1.1598.0-33A-b12-1.451.I            1       16:15:00         start
 33A                1.1598.0-33A-b12-1.451.I            1       16:57:00           end
 41C                10.3265.0-41C-b12-1.277.I           1       08:35:00         start
 41C                10.3265.0-41C-b12-1.277.I           1       09:26:00           end
 41C                100.3260.0-41C-b12-1.276.I          1       09:40:00         start
 41C                100.3260.0-41C-b12-1.276.I          1       10:53:00           end
 114                1000.987.0-114-b12-1.86.O           0       21:35:00         start
 114                1000.987.0-114-b12-1.86.O           0       22:02:00           end
 39                 10000.2877.0-39-b12-1.242.I         1       11:15:00         start
 39                 10000.2877.0-39-b12-1.242.I         1       12:30:00           end

It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:


route_short_name    trip_id                    direction_id    duration
33A                1.1598.0-33A-b12-1.451.I            1        42
41C                10.3265.0-41C-b12-1.277.I           1        51
41C                100.3260.0-41C-b12-1.276.I          1        73
114                1000.987.0-114-b12-1.86.O           0        27
39                 10000.2877.0-39-b12-1.242.I         1        75

I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.


Does anyone have a clue?


2 个解决方案



We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.


Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'


df1 %>%
    mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
    group_by(route_short_name, trip_id, direction_id) %>%
    summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups:   route_short_name, trip_id [?]
#  route_short_name                     trip_id direction_id duration
#             <chr>                       <chr>        <int>    <dbl>
#1              114   1000.987.0-114-b12-1.86.O            0       27
#2              33A    1.1598.0-33A-b12-1.451.I            1       42
#3               39 10000.2877.0-39-b12-1.242.I            1       75
#4              41C   10.3265.0-41C-b12-1.277.I            1       51
#5              41C  100.3260.0-41C-b12-1.276.I            1       73



Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.



wide_df <- 
  spread(your_df,key = stop_sequence, value = departure_time) %>% 
  mutate(timediff = as.numeric(difftime(end,start)))

If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book




We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.


Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'


df1 %>%
    mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
    group_by(route_short_name, trip_id, direction_id) %>%
    summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups:   route_short_name, trip_id [?]
#  route_short_name                     trip_id direction_id duration
#             <chr>                       <chr>        <int>    <dbl>
#1              114   1000.987.0-114-b12-1.86.O            0       27
#2              33A    1.1598.0-33A-b12-1.451.I            1       42
#3               39 10000.2877.0-39-b12-1.242.I            1       75
#4              41C   10.3265.0-41C-b12-1.277.I            1       51
#5              41C  100.3260.0-41C-b12-1.276.I            1       73



Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.



wide_df <- 
  spread(your_df,key = stop_sequence, value = departure_time) %>% 
  mutate(timediff = as.numeric(difftime(end,start)))

If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book
