在numpy阈值分析中对一些数据处理进行矢量化。

时间:2022-05-13 15:28:21

Basically I have some gridded meteorological data with dimensions (time, lat, lon).

基本上我有一些有网格的气象数据(时间,纬度,伦)。

  1. I need to go through each timeseries at each gridsquare, identify consecutive days ("events") when the variable is above a threshold and store that to a new variable (THdays)
  2. 我需要遍历每个gridsquare中的每个timeseries,确定连续的天数(“events”),当变量位于阈值之上并将其存储到一个新的变量(THdays)时
  3. Then I look through the new variable and find the events which are longer than a certain duration (THevents)
  4. 然后,我查看新的变量,并找到比特定时间(THevents)更长的事件

Currently I have a super scrappy (non-vectorised) iteration and I'd appreciate your advice on how to speed this up. Thanks!

目前我有一个非常混乱的(非矢量化的)迭代,我非常感谢您的建议,如何加快这一点。谢谢!

import numpy as np
import itertools as it
##### Parameters
lg = 2000  # length to initialise array (must be long to store large number of events)
rl = 180  # e.g latitude
cl = 360  # longitude
pcts = [95, 97, 99] # percentiles which are the thresholds that will be compared
dt = [1,2,3] #duration thresholds, i.e. consecutive values (days) above threshold

##### Data
data   # this is the gridded data that is (time,lat,lon) , e.g. data = np.random.rand(1000,rl,cl)
# From this data calculate the percentiles at each gridsquare (lat/lon combination) which will act as our thresholds
histpcts = np.percentile(data, q=pcts, axis = 0)


##### Initialize arrays to store the results
THdays = np.ndarray((rl, cl, lg, len(pcts)), dtype='int16') #Array to store consecutive threshold timesteps
THevents = np.ndarray((rl,cl,lg,len(pcts),len(dt)),dtype='int16')

##### Start iteration to identify events
for p in range(len(pcts)):  # for each threshold value
    br = data>histpcts[p,:,:]  # Make boolean array where data is bigger than threshold

    # for every lat/lon combination
    for r,c in it.product(range(rl),range(cl)): 
        if br[:,r,c].any()==True: # This is to skip timeseries with nans only and so the iteration is skipped. Important to keep this or something that ignores an array of nans
            a = [ sum( 1 for _ in group ) for key, group in it.groupby( br[:,r,c] ) if key ] # Find the consecutive instances
            tm = np.full(lg-len(a), np.nan)   # create an array of nans to fill in the rest


            # Assign to new array
            THdays[r,c,0:len(a),p] = a  # Consecutive Thresholds days
            THdays[r,c,len(a):,p] = tm  # Fill the rest of array

            # Now cycle through and identify events 
            # (consecutive values) longer than a certain duration (dt)
            for d in range(len(dt)):
                b = THdays[r,c,THdays[r,c,:,p]>=dt[d],p]
                THevents[r,c,0:len(b),p,d] = b

1 个解决方案

#1


0  

Have you tried numba? It will grant you a great speedup when you are using simple loops in numpy. All you need to do it is to put your code inside a function and apply the decorator @jit to decorate the fucntion. That's all!!!

你有试过numba吗?当您在numpy中使用简单的循环时,它将为您提供一个巨大的加速。您所需要做的就是将代码放入一个函数中,并应用decorator @jit来装饰fucntion。这就是! ! !

@jit
def myfun(inputs):
    ## crazy nested loops here

Of course more information you give numba better speedup you will get: you find more information here: http://numba.pydata.org/numba-doc/0.34.0/user/overview.html

当然,你会得到更多的信息,你会得到更多的信息:http://numba.pydata.org/numba-doc/0.34.0/user/overview.html。

#1


0  

Have you tried numba? It will grant you a great speedup when you are using simple loops in numpy. All you need to do it is to put your code inside a function and apply the decorator @jit to decorate the fucntion. That's all!!!

你有试过numba吗?当您在numpy中使用简单的循环时,它将为您提供一个巨大的加速。您所需要做的就是将代码放入一个函数中,并应用decorator @jit来装饰fucntion。这就是! ! !

@jit
def myfun(inputs):
    ## crazy nested loops here

Of course more information you give numba better speedup you will get: you find more information here: http://numba.pydata.org/numba-doc/0.34.0/user/overview.html

当然,你会得到更多的信息,你会得到更多的信息:http://numba.pydata.org/numba-doc/0.34.0/user/overview.html。