在R中聚合,重组每小时时间序列数据

时间:2022-04-03 17:17:39

I have a year's worth of hourly data in a data frame in R:

我在R中的数据框中有一年的小时数据:

> str(df.MHwind_load)   # compactly displays structure of data frame
'data.frame':   8760 obs. of  6 variables:
 $ Date         : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Time..HRs.   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Hour.of.Year : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Wind.MW      : int  375 492 483 476 486 512 421 396 456 453 ...
 $ MSEDCL.Demand: int  13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
 $ Net.Load     : int  12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...

While preserving the hourly structure, I would like to know how to extract

在保留每小时结构的同时,我想知道如何提取

  1. a particular month/group of months
  2. 特定月份/月份

  3. the first day/first week etc of each month
  4. 每个月的第一天/第一周等

  5. all mondays, all tuesdays etc of the year
  6. 所有星期一,全年的星期二等

I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.

我尝试过使用“cut”而没有结果,在网上看之后认为“lubridate”可能会这样做,但是没有找到合适的例子。我非常感谢这个问题的帮助。

Edit: a sample of data in the data frame is below:

编辑:数据框中的数据样本如下:

  Date Hour.of.Year  Wind.MW  datetime
1  2010-04-01  1  375  2010-04-01  00:00:00
2  2010-04-01  2  492  2010-04-01  01:00:00
3  2010-04-01  3  483  2010-04-01  02:00:00
4  2010-04-01  4  476  2010-04-01  03:00:00
5  2010-04-01  5  486  2010-04-01  04:00:00
6  2010-04-01  6  512  2010-04-01  05:00:00
7  2010-04-01  7  421  2010-04-01  06:00:00
8  2010-04-01  8  396  2010-04-01  07:00:00
9  2010-04-01  9  456  2010-04-01  08:00:00
10  2010-04-01  10  453  2010-04-01  09:00:00
..  ..  ...  ..........  ........
8758  2011-03-31  8758  302  2011-03-31  21:00:00
8759  2011-03-31  8759  378  2011-03-31  22:00:00
8760  2011-03-31  8760  356  2011-03-31  23:00:00

EDIT: Additional time-based operations I would like to perform on the same dataset 1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points) 2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively 3. Do seasonal averages, for example for June to September

编辑:我想对同一数据集执行的其他基于时间的操作1.对所有数据点执行逐小时平均,即一年中每天的第一个小时内所有值的平均值。输出将是全年的“小时概况”(24个时间点)2。每周和每个月执行相同的操作,即分别获得52和12小时的概况.3。季节性平均值,例如6月至9月

3 个解决方案

#1


6  

Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.

将日期转换为lubridate理解的格式,然后分别使用函数month,mday,wday。

Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:

假设您有一个data.frame,时间存储在Date日期,那么您的问题的答案将是:

 ###dummy data.frame
 df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4) 
 ##1. Select rows for particular month
 subset(df,month(Date)==1)

 ##2a. Select the first day of each month
 subset(df,mday(Date)==1)

 ##2b. Select the first week of each month
 ##get the week numbers which have the first day of the month
 wkd <- subset(week(df$Date),mday(df$Date)==1)
 ##select the weeks with particular numbers
 subset(df,week(Date) %in% wkd)     

 ##3. Select all mondays 
 subset(df,wday(Date)==1)

#2


6  

  1. First switch to a Date representation: as.Date(df.MHwind_load$Date)
  2. 首先切换到Date表示:as.Date(df.MHwind_load $ Date)

  3. Then call weekdays on the date vector to get a new factor labelled with day of week
  4. 然后在日期向量上调用工作日以获得标记为星期几的新因子

  5. Then call months on the date vector to get a new factor labelled with name of month
  6. 然后在日期向量上调用月份以获得标有月份名称的新因子

  7. Optionally create a years variable (see below).
  8. (可选)创建年份变量(见下文)。

Now subset the data frame using the relevant combination of these. Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.

现在使用这些的相关组合对数据帧进行子集化。步骤2.得到你的任务的答案3.步骤3.和4.让你进入任务1.任务2可能需要一行或两行R.或者只选择对应于,例如,一个月内所有星期一的行和调用唯一,或其结果上的重复。

To get you going...

为了让你去...

newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)

## for some reason R has no years function.  Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }

newdf$year <- years(newdf$d)

# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))

# get all Monday observations
subset(newdf, day == 'Monday')

# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')

# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day)) 
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')

#3


3  

Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.

由于您没有询问数据的时间(每小时)部分,因此最好将数据存储为Date对象。否则,您可能对chron感兴趣,它也有一些便利功能,如下所示。

With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).

关于Conjugate Prior的答案,您应该将日期数据存储为Date对象。由于您的数据已经遵循默认格式('yyyy-mm-dd'),因此您可以在其上调用as.Date。否则,您必须指定字符串格式。我还会在你的因素上使用as.character来确保你没有内联错误。我知道因为这个原因我已经遇到了因素进入日期的问题(可能在当前版本中得到纠正)。

df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))

Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:

现在,您可以创建提取所需信息的包装函数。您可以像上面一样使用转换来简单地添加表示月,日,年等的列,然后在逻辑上对它们进行子集化。或者,您可以执行以下操作:

getMonth <- function(x, mo) {  # This function assumes w/in single year vector
  isMonth <- month(x) %in% mo  # Boolean of matching months
  return(x[which(isMonth)]     # Return vector of matching months
}  # end function

Or, in short form

或者,简短形式

getMonth <- function(x, mo) x[month(x) %in% mo]

This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).

这只是在存储该信息(变换帧)或在需要时对其进行处理(使用存取方法)之间的权衡。

A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.

例如,一个月的第一天,您需要一个更复杂的过程。但这并不完全困难。下面是一个将返回所有这些值的函数,但是对于给定月份的值的已排序向量进行子集并获取它们的第一个值是相当简单的。

getFirstDay <- function(x, mo) {
  isMonth <- months(x) %in% mo
  x <- sort(x[isMonth])  # Look at only those in the desired month.
                         # Sort them by date. We only want the first day.
  nFirsts <- rle(as.numeric(x))$len[1]  # Returns length of 1st days
  return(x[seq(nFirsts)])
}  # end function

The easier alternative would be

更容易的替代方案是

getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}

I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).

我没有对这些进行原型设计,因为您没有提供任何数据样本,但这种方法可以帮助您获得所需的信息。由您决定如何将这些内容纳入您的工作流程。例如,假设您希望获得给定年份每个月的第一天(假设我们只看一年;您可以创建包装或预先处理您的向量一年)。

# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)),  # Iterate through months in Dates
       function(month) {getFirstDayOnly(df$date, month)})

The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.

以上也可以设计为使用其他访问器功能的单独便利功能。通过这种方式,您可以创建一系列直接但简洁的方法来获取所需的信息。然后,您只需将它们组合在一起即可创建非常简单易懂的功能,您可以在脚本中使用这些功能,以最有效的方式精确地获得您所需的功能。

You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

您应该能够使用上面的示例来弄清楚如何原型化其他包装器以访问您需要的日期信息。如果您需要帮助,请随时在评论中提问。

#1


6  

Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.

将日期转换为lubridate理解的格式,然后分别使用函数month,mday,wday。

Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:

假设您有一个data.frame,时间存储在Date日期,那么您的问题的答案将是:

 ###dummy data.frame
 df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4) 
 ##1. Select rows for particular month
 subset(df,month(Date)==1)

 ##2a. Select the first day of each month
 subset(df,mday(Date)==1)

 ##2b. Select the first week of each month
 ##get the week numbers which have the first day of the month
 wkd <- subset(week(df$Date),mday(df$Date)==1)
 ##select the weeks with particular numbers
 subset(df,week(Date) %in% wkd)     

 ##3. Select all mondays 
 subset(df,wday(Date)==1)

#2


6  

  1. First switch to a Date representation: as.Date(df.MHwind_load$Date)
  2. 首先切换到Date表示:as.Date(df.MHwind_load $ Date)

  3. Then call weekdays on the date vector to get a new factor labelled with day of week
  4. 然后在日期向量上调用工作日以获得标记为星期几的新因子

  5. Then call months on the date vector to get a new factor labelled with name of month
  6. 然后在日期向量上调用月份以获得标有月份名称的新因子

  7. Optionally create a years variable (see below).
  8. (可选)创建年份变量(见下文)。

Now subset the data frame using the relevant combination of these. Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.

现在使用这些的相关组合对数据帧进行子集化。步骤2.得到你的任务的答案3.步骤3.和4.让你进入任务1.任务2可能需要一行或两行R.或者只选择对应于,例如,一个月内所有星期一的行和调用唯一,或其结果上的重复。

To get you going...

为了让你去...

newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)

## for some reason R has no years function.  Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }

newdf$year <- years(newdf$d)

# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))

# get all Monday observations
subset(newdf, day == 'Monday')

# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')

# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day)) 
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')

#3


3  

Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.

由于您没有询问数据的时间(每小时)部分,因此最好将数据存储为Date对象。否则,您可能对chron感兴趣,它也有一些便利功能,如下所示。

With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).

关于Conjugate Prior的答案,您应该将日期数据存储为Date对象。由于您的数据已经遵循默认格式('yyyy-mm-dd'),因此您可以在其上调用as.Date。否则,您必须指定字符串格式。我还会在你的因素上使用as.character来确保你没有内联错误。我知道因为这个原因我已经遇到了因素进入日期的问题(可能在当前版本中得到纠正)。

df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))

Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:

现在,您可以创建提取所需信息的包装函数。您可以像上面一样使用转换来简单地添加表示月,日,年等的列,然后在逻辑上对它们进行子集化。或者,您可以执行以下操作:

getMonth <- function(x, mo) {  # This function assumes w/in single year vector
  isMonth <- month(x) %in% mo  # Boolean of matching months
  return(x[which(isMonth)]     # Return vector of matching months
}  # end function

Or, in short form

或者,简短形式

getMonth <- function(x, mo) x[month(x) %in% mo]

This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).

这只是在存储该信息(变换帧)或在需要时对其进行处理(使用存取方法)之间的权衡。

A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.

例如,一个月的第一天,您需要一个更复杂的过程。但这并不完全困难。下面是一个将返回所有这些值的函数,但是对于给定月份的值的已排序向量进行子集并获取它们的第一个值是相当简单的。

getFirstDay <- function(x, mo) {
  isMonth <- months(x) %in% mo
  x <- sort(x[isMonth])  # Look at only those in the desired month.
                         # Sort them by date. We only want the first day.
  nFirsts <- rle(as.numeric(x))$len[1]  # Returns length of 1st days
  return(x[seq(nFirsts)])
}  # end function

The easier alternative would be

更容易的替代方案是

getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}

I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).

我没有对这些进行原型设计,因为您没有提供任何数据样本,但这种方法可以帮助您获得所需的信息。由您决定如何将这些内容纳入您的工作流程。例如,假设您希望获得给定年份每个月的第一天(假设我们只看一年;您可以创建包装或预先处理您的向量一年)。

# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)),  # Iterate through months in Dates
       function(month) {getFirstDayOnly(df$date, month)})

The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.

以上也可以设计为使用其他访问器功能的单独便利功能。通过这种方式,您可以创建一系列直接但简洁的方法来获取所需的信息。然后,您只需将它们组合在一起即可创建非常简单易懂的功能,您可以在脚本中使用这些功能,以最有效的方式精确地获得您所需的功能。

You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

您应该能够使用上面的示例来弄清楚如何原型化其他包装器以访问您需要的日期信息。如果您需要帮助,请随时在评论中提问。