在numpy数组中插入NaN值

时间:2023-02-04 21:19:44

Is there a quick way of replacing all NaN values in a numpy array with (say) the linearly interpolated values?

是否有一种快速的方法将numpy数组中的所有NaN值替换为线性插值?

For example,

例如,

[1 1 1 nan nan 2 2 nan 0]

would be converted into

将转化为

[1 1 1 1.3 1.6 2 2  1  0]

8 个解决方案

#1


61  

Lets define first a simple helper function in order to make it more straightforward to handle indices and logical indices of NaNs:

让我们先定义一个简单的辅助函数,以便更直接地处理NaNs的索引和逻辑索引:

import numpy as np

def nan_helper(y):
    """Helper to handle indices and logical indices of NaNs.

    Input:
        - y, 1d numpy array with possible NaNs
    Output:
        - nans, logical indices of NaNs
        - index, a function, with signature indices= index(logical_indices),
          to convert logical indices of NaNs to 'equivalent' indices
    Example:
        >>> # linear interpolation of NaNs
        >>> nans, x= nan_helper(y)
        >>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
    """

    return np.isnan(y), lambda z: z.nonzero()[0]

Now the nan_helper(.) can now be utilized like:

现在可以使用nan_helper(.)

>>> y= array([1, 1, 1, NaN, NaN, 2, 2, NaN, 0])
>>>
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
>>>
>>> print y.round(2)
[ 1.    1.    1.    1.33  1.67  2.    2.    1.    0.  ]

---
Although it may seem first a little bit overkill to specify a separate function to do just things like this:

---虽然先指定一个单独的函数来做这样的事情似乎有点过分:

>>> nans, x= np.isnan(y), lambda z: z.nonzero()[0]

it will eventually pay dividends.

它最终将支付股息。

So, whenever you are working with NaNs related data, just encapsulate all the (new NaN related) functionality needed, under some specific helper function(s). Your code base will be more coherent and readable, because it follows easily understandable idioms.

因此,无论何时处理与NaNs相关的数据,只要将所需的(新的NaN相关的)功能封装在某个特定的helper函数下即可。您的代码库将更加一致和可读,因为它遵循易于理解的习惯用法。

Interpolation, indeed, is a nice context to see how NaN handling is done, but similar techniques are utilized in various other contexts as well.

实际上,插值是一个很好的上下文,可以看到NaN处理是如何完成的,但是类似的技术也可以在其他的环境中使用。

#2


19  

I came up with this code:

我想到了这个密码:

import numpy as np
nan = np.nan

A = np.array([1, nan, nan, 2, 2, nan, 0])

ok = -np.isnan(A)
xp = ok.ravel().nonzero()[0]
fp = A[-np.isnan(A)]
x  = np.isnan(A).ravel().nonzero()[0]

A[np.isnan(A)] = np.interp(x, xp, fp)

print A

It prints

它打印

 [ 1.          1.33333333  1.66666667  2.          2.          1.          0.        ]

#3


8  

Just use numpy logical and there where statement to apply a 1D interpolation.

只需要使用numpy逻辑,然后在where语句中应用1D插值。

import numpy as np
from scipy import interpolate

def fill_nan(A):
    '''
    interpolate to fill nan values
    '''
    inds = np.arange(A.shape[0])
    good = np.where(np.isfinite(A))
    f = interpolate.interp1d(inds[good], A[good],bounds_error=False)
    B = np.where(np.isfinite(A),A,f(inds))
    return B

#4


4  

It might be easier to change how the data is being generated in the first place, but if not:

可能一开始就更容易改变数据的生成方式,但如果不是:

bad_indexes = np.isnan(data)

Create a boolean array indicating where the nans are

创建一个布尔数组,指示nans的位置

good_indexes = np.logical_not(bad_indexes)

Create a boolean array indicating where the good values area

创建一个布尔数组,指示好值区域的位置

good_data = data[good_indexes]

A restricted version of the original data excluding the nans

原始数据的限制版本,不包括nans

interpolated = np.interp(bad_indexes.nonzero(), good_indexes.nonzero(), good_data)

Run all the bad indexes through interpolation

通过插值运行所有的坏索引

data[bad_indexes] = interpolated

Replace the original data with the interpolated values.

用插入值替换原始数据。

#5


2  

Or building on Winston's answer

或者以温斯顿的回答为基础

def pad(data):
    bad_indexes = np.isnan(data)
    good_indexes = np.logical_not(bad_indexes)
    good_data = data[good_indexes]
    interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
    data[bad_indexes] = interpolated
    return data

A = np.array([[1, 20, 300],
              [nan, nan, nan],
              [3, 40, 500]])

A = np.apply_along_axis(pad, 0, A)
print A

Result

结果

[[   1.   20.  300.]
 [   2.   30.  400.]
 [   3.   40.  500.]]

#6


2  

For two dimensional data, the SciPy's griddata works fairly well for me:

对于二维数据,SciPy的griddata对我来说很好:

>>> import numpy as np
>>> from scipy.interpolate import griddata
>>>
>>> # SETUP
>>> a = np.arange(25).reshape((5, 5)).astype(float)
>>> a
array([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])
>>> a[np.random.randint(2, size=(5, 5)).astype(bool)] = np.NaN
>>> a
array([[ nan,  nan,  nan,   3.,   4.],
       [ nan,   6.,   7.,  nan,  nan],
       [ 10.,  nan,  nan,  13.,  nan],
       [ 15.,  16.,  17.,  nan,  19.],
       [ nan,  nan,  22.,  23.,  nan]])
>>>
>>> # THE INTERPOLATION
>>> x, y = np.indices(a.shape)
>>> interp = np.array(a)
>>> interp[np.isnan(interp)] = griddata(
...     (x[~np.isnan(a)], y[~np.isnan(a)]), # points we know
...     a[~np.isnan(a)],                    # values we know
...     (x[np.isnan(a)], y[np.isnan(a)]))   # points to interpolate
>>> interp
array([[ nan,  nan,  nan,   3.,   4.],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ nan,  nan,  22.,  23.,  nan]])

I am using it on 3D images, operating on 2D slices (4000 slices of 350x350). The whole operation still takes about an hour :/

我正在3D图像上使用它,操作2D切片(4000片350x350)。整个手术还需要一个小时

#7


2  

Building on the answer by Bryan Woods, I modified his code to also convert lists consisting only of NaN to a list of zeros:

基于Bryan Woods的回答,我修改了他的代码,将只包含NaN的列表转换为0的列表:

def fill_nan(A):
    '''
    interpolate to fill nan values
    '''
    inds = np.arange(A.shape[0])
    good = np.where(np.isfinite(A))
    if len(good[0]) == 0:
        return np.nan_to_num(A)
    f = interp1d(inds[good], A[good], bounds_error=False)
    B = np.where(np.isfinite(A), A, f(inds))
    return B

Simple addition, I hope it will be of use to someone.

简单的加法,我希望它对某些人有用。

#8


1  

I needed an approach that would also fill in NaN's at the start of end of the data, which the main answer does not appear to do.

我需要一种方法,也可以在数据结束时填写NaN的数据,而主要的答案似乎并没有这样做。

The function I came up with uses a linear regression to fill in the NaN's. This overcomes my problem:

我提出的函数使用线性回归来填充NaN。这克服了我的问题:

import numpy as np

def linearly_interpolate_nans(y):
    # Fit a linear regression to the non-nan y values

    # Create X matrix for linreg with an intercept and an index
    X = np.vstack((np.ones(len(y)), np.arange(len(y))))

    # Get the non-NaN values of X and y
    X_fit = X[:, ~np.isnan(y)]
    y_fit = y[~np.isnan(y)].reshape(-1, 1)

    # Estimate the coefficients of the linear regression
    beta = np.linalg.lstsq(X_fit.T, y_fit)[0]

    # Fill in all the nan values using the predicted coefficients
    y.flat[np.isnan(y)] = np.dot(X[:, np.isnan(y)].T, beta)
    return y

Here's an example usage case:

这里有一个示例用例:

# Make an array according to some linear function
y = np.arange(12) * 1.5 + 10.

# First and last value are NaN
y[0] = np.nan
y[-1] = np.nan

# 30% of other values are NaN
for i in range(len(y)):
    if np.random.rand() > 0.7:
        y[i] = np.nan

# NaN's are filled in!
print y
print linearly_interpolate_nans(y)

#1


61  

Lets define first a simple helper function in order to make it more straightforward to handle indices and logical indices of NaNs:

让我们先定义一个简单的辅助函数,以便更直接地处理NaNs的索引和逻辑索引:

import numpy as np

def nan_helper(y):
    """Helper to handle indices and logical indices of NaNs.

    Input:
        - y, 1d numpy array with possible NaNs
    Output:
        - nans, logical indices of NaNs
        - index, a function, with signature indices= index(logical_indices),
          to convert logical indices of NaNs to 'equivalent' indices
    Example:
        >>> # linear interpolation of NaNs
        >>> nans, x= nan_helper(y)
        >>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
    """

    return np.isnan(y), lambda z: z.nonzero()[0]

Now the nan_helper(.) can now be utilized like:

现在可以使用nan_helper(.)

>>> y= array([1, 1, 1, NaN, NaN, 2, 2, NaN, 0])
>>>
>>> nans, x= nan_helper(y)
>>> y[nans]= np.interp(x(nans), x(~nans), y[~nans])
>>>
>>> print y.round(2)
[ 1.    1.    1.    1.33  1.67  2.    2.    1.    0.  ]

---
Although it may seem first a little bit overkill to specify a separate function to do just things like this:

---虽然先指定一个单独的函数来做这样的事情似乎有点过分:

>>> nans, x= np.isnan(y), lambda z: z.nonzero()[0]

it will eventually pay dividends.

它最终将支付股息。

So, whenever you are working with NaNs related data, just encapsulate all the (new NaN related) functionality needed, under some specific helper function(s). Your code base will be more coherent and readable, because it follows easily understandable idioms.

因此,无论何时处理与NaNs相关的数据,只要将所需的(新的NaN相关的)功能封装在某个特定的helper函数下即可。您的代码库将更加一致和可读,因为它遵循易于理解的习惯用法。

Interpolation, indeed, is a nice context to see how NaN handling is done, but similar techniques are utilized in various other contexts as well.

实际上,插值是一个很好的上下文,可以看到NaN处理是如何完成的,但是类似的技术也可以在其他的环境中使用。

#2


19  

I came up with this code:

我想到了这个密码:

import numpy as np
nan = np.nan

A = np.array([1, nan, nan, 2, 2, nan, 0])

ok = -np.isnan(A)
xp = ok.ravel().nonzero()[0]
fp = A[-np.isnan(A)]
x  = np.isnan(A).ravel().nonzero()[0]

A[np.isnan(A)] = np.interp(x, xp, fp)

print A

It prints

它打印

 [ 1.          1.33333333  1.66666667  2.          2.          1.          0.        ]

#3


8  

Just use numpy logical and there where statement to apply a 1D interpolation.

只需要使用numpy逻辑,然后在where语句中应用1D插值。

import numpy as np
from scipy import interpolate

def fill_nan(A):
    '''
    interpolate to fill nan values
    '''
    inds = np.arange(A.shape[0])
    good = np.where(np.isfinite(A))
    f = interpolate.interp1d(inds[good], A[good],bounds_error=False)
    B = np.where(np.isfinite(A),A,f(inds))
    return B

#4


4  

It might be easier to change how the data is being generated in the first place, but if not:

可能一开始就更容易改变数据的生成方式,但如果不是:

bad_indexes = np.isnan(data)

Create a boolean array indicating where the nans are

创建一个布尔数组,指示nans的位置

good_indexes = np.logical_not(bad_indexes)

Create a boolean array indicating where the good values area

创建一个布尔数组,指示好值区域的位置

good_data = data[good_indexes]

A restricted version of the original data excluding the nans

原始数据的限制版本,不包括nans

interpolated = np.interp(bad_indexes.nonzero(), good_indexes.nonzero(), good_data)

Run all the bad indexes through interpolation

通过插值运行所有的坏索引

data[bad_indexes] = interpolated

Replace the original data with the interpolated values.

用插入值替换原始数据。

#5


2  

Or building on Winston's answer

或者以温斯顿的回答为基础

def pad(data):
    bad_indexes = np.isnan(data)
    good_indexes = np.logical_not(bad_indexes)
    good_data = data[good_indexes]
    interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
    data[bad_indexes] = interpolated
    return data

A = np.array([[1, 20, 300],
              [nan, nan, nan],
              [3, 40, 500]])

A = np.apply_along_axis(pad, 0, A)
print A

Result

结果

[[   1.   20.  300.]
 [   2.   30.  400.]
 [   3.   40.  500.]]

#6


2  

For two dimensional data, the SciPy's griddata works fairly well for me:

对于二维数据,SciPy的griddata对我来说很好:

>>> import numpy as np
>>> from scipy.interpolate import griddata
>>>
>>> # SETUP
>>> a = np.arange(25).reshape((5, 5)).astype(float)
>>> a
array([[  0.,   1.,   2.,   3.,   4.],
       [  5.,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ 20.,  21.,  22.,  23.,  24.]])
>>> a[np.random.randint(2, size=(5, 5)).astype(bool)] = np.NaN
>>> a
array([[ nan,  nan,  nan,   3.,   4.],
       [ nan,   6.,   7.,  nan,  nan],
       [ 10.,  nan,  nan,  13.,  nan],
       [ 15.,  16.,  17.,  nan,  19.],
       [ nan,  nan,  22.,  23.,  nan]])
>>>
>>> # THE INTERPOLATION
>>> x, y = np.indices(a.shape)
>>> interp = np.array(a)
>>> interp[np.isnan(interp)] = griddata(
...     (x[~np.isnan(a)], y[~np.isnan(a)]), # points we know
...     a[~np.isnan(a)],                    # values we know
...     (x[np.isnan(a)], y[np.isnan(a)]))   # points to interpolate
>>> interp
array([[ nan,  nan,  nan,   3.,   4.],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  11.,  12.,  13.,  14.],
       [ 15.,  16.,  17.,  18.,  19.],
       [ nan,  nan,  22.,  23.,  nan]])

I am using it on 3D images, operating on 2D slices (4000 slices of 350x350). The whole operation still takes about an hour :/

我正在3D图像上使用它,操作2D切片(4000片350x350)。整个手术还需要一个小时

#7


2  

Building on the answer by Bryan Woods, I modified his code to also convert lists consisting only of NaN to a list of zeros:

基于Bryan Woods的回答,我修改了他的代码,将只包含NaN的列表转换为0的列表:

def fill_nan(A):
    '''
    interpolate to fill nan values
    '''
    inds = np.arange(A.shape[0])
    good = np.where(np.isfinite(A))
    if len(good[0]) == 0:
        return np.nan_to_num(A)
    f = interp1d(inds[good], A[good], bounds_error=False)
    B = np.where(np.isfinite(A), A, f(inds))
    return B

Simple addition, I hope it will be of use to someone.

简单的加法,我希望它对某些人有用。

#8


1  

I needed an approach that would also fill in NaN's at the start of end of the data, which the main answer does not appear to do.

我需要一种方法,也可以在数据结束时填写NaN的数据,而主要的答案似乎并没有这样做。

The function I came up with uses a linear regression to fill in the NaN's. This overcomes my problem:

我提出的函数使用线性回归来填充NaN。这克服了我的问题:

import numpy as np

def linearly_interpolate_nans(y):
    # Fit a linear regression to the non-nan y values

    # Create X matrix for linreg with an intercept and an index
    X = np.vstack((np.ones(len(y)), np.arange(len(y))))

    # Get the non-NaN values of X and y
    X_fit = X[:, ~np.isnan(y)]
    y_fit = y[~np.isnan(y)].reshape(-1, 1)

    # Estimate the coefficients of the linear regression
    beta = np.linalg.lstsq(X_fit.T, y_fit)[0]

    # Fill in all the nan values using the predicted coefficients
    y.flat[np.isnan(y)] = np.dot(X[:, np.isnan(y)].T, beta)
    return y

Here's an example usage case:

这里有一个示例用例:

# Make an array according to some linear function
y = np.arange(12) * 1.5 + 10.

# First and last value are NaN
y[0] = np.nan
y[-1] = np.nan

# 30% of other values are NaN
for i in range(len(y)):
    if np.random.rand() > 0.7:
        y[i] = np.nan

# NaN's are filled in!
print y
print linearly_interpolate_nans(y)