Python 3.2 - Numpy 1.9 genfromtxt。

时间:2023-01-26 21:27:47

I am trying to read in a file that has multiple data formats in a .csv format. I am using Python3.2 and Numpy 1.9. I am using the numpy genfromtxt function to read in the data. I was hoping i could convert the data as I read to store it appropriately instead of processing it later, for which i am using converter functions in the options.

我正在尝试读取一个以.csv格式具有多种数据格式的文件。我使用的是Python3.2和Numpy 1.9。我使用的是numpy genfromtxt函数来读取数据。我希望我可以在读取数据时转换数据,而不是以后处理它,因为我在选项中使用了转换器功能。

Using multiple converter functions seems to be giving an issue. The code, the input and the output of the code are listed below. As you can see, the first line output is from a different column of the input file than the others.

使用多个转换器功能似乎是一个问题。代码、输入和代码的输出如下所示。如您所见,第一行输出来自输入文件的不同列,而不是其他列。

Has any one used this feature before ? IS there a bug in my code somewhere?

以前有人用过这个功能吗?我的代码中有bug吗?

CODE:

代码:

 converterfunc_time=   lambda x : (datetime.strptime(x.decode('UTF-8'),'%m/%d/%Y %I:%M:%S %p'))
    def converterfunc_lat(x):
        print(x);    print(x.decode('UTF-8'))
        #return float(x.decode('utf-8').split('N')[1])
    def converterfunc_san(x):
        #print(x)
        return (x.decode('UTF-8'))  



class input_file_processing():
        def __init__(self): 
             self.input_data=(np.genfromtxt('filename',skip_header=1,dtype=None,usecols=(0,1,6,7,8,9,10,13), names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
                              converters=0:converterfunc_time,1:converterfunc_san,6:converterfunc_lat},    delimiter=','))

**INPUT **

输入* * * *

input, file, 1
4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33,00.546,W118,00.638,3,11,1,104,102,82,6,18,2048,4039587
4/2/2015 2:13:55 PM,DSN001000861511,03-01-02,0010416164,0,0,N33,00.883,W118,00.208,3,11,1,106,102,88,6,18,2048,2792940
4/2/2015 2:14:44 PM,DSN001000871692,03-01-04,0010408734,0,0,N33,00.876,W118,00.110,3,11,1,105,102,80,6,18,2048,312623
4/2/2015 2:14:52 PM,DSN001000864906,03-01-05,0010055143,0,0,N33,08.000,W118,03.000,3,11,1,107,99,83,6,18,2048,3056425
4/2/2015 2:15:00 PM,DSN001000838651,03-01-06,0010265541,0,0,N33,09.749,W118,00.317,3,11,1,100,110,74,6,14,2048,3737937
4/2/2015 2:15:08 PM,DSN001000609313,03-01-07,0010152885,0,0,N33,05.854,W118,04.107,3,11,1,94,95,62,6,14,2048,8221318
4/2/2015 2:15:19 PM,DSS31967278,03-01-08,0010350817,0,0,N33,04.551,W118,02.359,3,11,1,127,105,77,6,21,2048,21157710
4/2/2015 2:16:08 PM,DSN001000822728,03-01-10,0010051377,0,0,N33,00.899,W118,00.132,3,11,1,116,95,61,6,19,2048,3526254

OUTPUT

输出

b'03-01-01'
03-01-01
b'N33'
N33
b'N33'
N33
b'N33'
N33
b'N33'
N33
b'N33'

Thanks

谢谢

1 个解决方案

#1


0  

I'm not entirely sure what is going on. But this script runs:

我不太清楚到底发生了什么。但该脚本运行:

import numpy as np
from datetime import datetime

txt = b"""input, file, 1
4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33,00.546,W118,00.638,3,11,1,104,102,82,6,18,2048,4039587
4/2/2015 2:13:55 PM,DSN001000861511,03-01-02,0010416164,0,0,N34,00.883,W118,00.208,3,11,1,106,102,88,6,18,2048,2792940
4/2/2015 2:14:44 PM,DSN001000871692,03-01-04,0010408734,0,0,N35,00.876,W118,00.110,3,11,1,105,102,80,6,18,2048,312623
4/2/2015 2:14:52 PM,DSN001000864906,03-01-05,0010055143,0,0,N36,08.000,W118,03.000,3,11,1,107,99,83,6,18,2048,3056425
4/2/2015 2:15:00 PM,DSN001000838651,03-01-06,0010265541,0,0,N33,09.749,W118,00.317,3,11,1,100,110,74,6,14,2048,3737937
4/2/2015 2:15:08 PM,DSN001000609313,03-01-07,0010152885,0,0,N33,05.854,W118,04.107,3,11,1,94,95,62,6,14,2048,8221318
"""
txt = txt.splitlines()
#txt = txt[1:]
txt = txt[:3]
converterfunc_time = lambda x : (datetime.strptime(x.decode('UTF-8'),'%m/%d/%Y %I:%M:%S %p'))
def converterfunc_lat(x):
    print('lat ',x, x.decode('UTF-8'))
    x1 = x.decode('utf-8').split('N')
    if len(x1)>1:
        x1 = float(x1[1])
        print('float',x1)
        return x1
    else:
        print('error')
        return "error"
def converterfunc_san(x):
    #print(x)
    return x.decode('UTF-8')

data = np.genfromtxt(txt, skip_header=1,
                    dtype=None,
                    usecols=(0,1,6,7,8,9,10,13),
                    names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
                    delimiter=',')
print(data)
print()
input_data=np.genfromtxt(txt,
            skip_header=1,
            dtype='O,a20,f',
            usecols=(0,1,6,), #(0,1,6,7,8,9,10,13),
            names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
            converters={0:converterfunc_time,
                        1:converterfunc_san,
                        6:converterfunc_lat},
            delimiter=',')
print(input_data)

and produces

并产生

1552:~/mypy$ python3 stack30269235.py 
[ (b'4/2/2015 2:13:44 PM', b'DSN001000557867', b'N33', 0.546, b'W118', 0.638, 3, 104)
 (b'4/2/2015 2:13:55 PM', b'DSN001000861511', b'N34', 0.883, b'W118', 0.208, 3, 106)]

lat  b'03-01-01' 03-01-01
error
lat  b'N33' N33
float 33.0
lat  b'N34' N34
float 34.0
[(datetime.datetime(2015, 4, 2, 14, 13, 44), b'DSN001000557867', 33.0)
 (datetime.datetime(2015, 4, 2, 14, 13, 55), b'DSN001000861511', 34.0)]

I've had to fill in some pieces that were missing in your question.

我不得不填补你的问题中遗漏的部分。

I've added an explicit dtype to make sure I was getting the string and float columns.

我添加了一个显式的dtype来确保我得到了字符串和浮动列。

And I modified the lat converter so it does not choke on the '03-01-01' input. ...

我修改了lat转换器,所以它不会被“03-01-01”输入阻塞。


genfromtxt makes some sort of test run of your converters:

genfromtxt对转换器进行了一些测试:

    # Find the value to test:
    if len(first_line):
        testing_value = first_values[i]
    else:
        testing_value = None
    converters[i].update(conv, locked=True,
                         testing_value=testing_value,
                         default=filling_values[i],
                         missing_values=missing_values[i],)
    uc_update.append((i, conv))

Looks like it is taking the first data line:

看起来好像是第一个数据线:

4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33

splitting it on the delimiter, and using the 3rd string, 03-01-01, as the test value. i.e instead of 6, it is using the index of 6 in your usecols parameter. It's having problems matching the usecols, the converters ids, names and maybe the dtype.

将其拆分为分隔符,并使用第三个字符串03-01-01作为测试值。我。e而不是6,它使用的是你的usecols参数中的6。它与usecols、转换器id、名称和dtype匹配有问题。

The purpose of this test value is to determine the dtype for the column. This is needed in the dtype=None case. I don't know how it is used if you specify the dtype. Evidently it still runs it.

这个测试值的目的是确定该列的dtype。这在dtype=None中是必需的。如果指定dtype,我不知道它是如何使用的。显然,它还在运行。

In tests where I am not skipping columns, it has no problem matching converters and test values.

在我不跳过列的测试中,它没有匹配转换器和测试值的问题。

#1


0  

I'm not entirely sure what is going on. But this script runs:

我不太清楚到底发生了什么。但该脚本运行:

import numpy as np
from datetime import datetime

txt = b"""input, file, 1
4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33,00.546,W118,00.638,3,11,1,104,102,82,6,18,2048,4039587
4/2/2015 2:13:55 PM,DSN001000861511,03-01-02,0010416164,0,0,N34,00.883,W118,00.208,3,11,1,106,102,88,6,18,2048,2792940
4/2/2015 2:14:44 PM,DSN001000871692,03-01-04,0010408734,0,0,N35,00.876,W118,00.110,3,11,1,105,102,80,6,18,2048,312623
4/2/2015 2:14:52 PM,DSN001000864906,03-01-05,0010055143,0,0,N36,08.000,W118,03.000,3,11,1,107,99,83,6,18,2048,3056425
4/2/2015 2:15:00 PM,DSN001000838651,03-01-06,0010265541,0,0,N33,09.749,W118,00.317,3,11,1,100,110,74,6,14,2048,3737937
4/2/2015 2:15:08 PM,DSN001000609313,03-01-07,0010152885,0,0,N33,05.854,W118,04.107,3,11,1,94,95,62,6,14,2048,8221318
"""
txt = txt.splitlines()
#txt = txt[1:]
txt = txt[:3]
converterfunc_time = lambda x : (datetime.strptime(x.decode('UTF-8'),'%m/%d/%Y %I:%M:%S %p'))
def converterfunc_lat(x):
    print('lat ',x, x.decode('UTF-8'))
    x1 = x.decode('utf-8').split('N')
    if len(x1)>1:
        x1 = float(x1[1])
        print('float',x1)
        return x1
    else:
        print('error')
        return "error"
def converterfunc_san(x):
    #print(x)
    return x.decode('UTF-8')

data = np.genfromtxt(txt, skip_header=1,
                    dtype=None,
                    usecols=(0,1,6,7,8,9,10,13),
                    names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
                    delimiter=',')
print(data)
print()
input_data=np.genfromtxt(txt,
            skip_header=1,
            dtype='O,a20,f',
            usecols=(0,1,6,), #(0,1,6,7,8,9,10,13),
            names="Date,SAN,LatDeg,LatMin,LonDeg,LonMin,Beam,EsNo",
            converters={0:converterfunc_time,
                        1:converterfunc_san,
                        6:converterfunc_lat},
            delimiter=',')
print(input_data)

and produces

并产生

1552:~/mypy$ python3 stack30269235.py 
[ (b'4/2/2015 2:13:44 PM', b'DSN001000557867', b'N33', 0.546, b'W118', 0.638, 3, 104)
 (b'4/2/2015 2:13:55 PM', b'DSN001000861511', b'N34', 0.883, b'W118', 0.208, 3, 106)]

lat  b'03-01-01' 03-01-01
error
lat  b'N33' N33
float 33.0
lat  b'N34' N34
float 34.0
[(datetime.datetime(2015, 4, 2, 14, 13, 44), b'DSN001000557867', 33.0)
 (datetime.datetime(2015, 4, 2, 14, 13, 55), b'DSN001000861511', 34.0)]

I've had to fill in some pieces that were missing in your question.

我不得不填补你的问题中遗漏的部分。

I've added an explicit dtype to make sure I was getting the string and float columns.

我添加了一个显式的dtype来确保我得到了字符串和浮动列。

And I modified the lat converter so it does not choke on the '03-01-01' input. ...

我修改了lat转换器,所以它不会被“03-01-01”输入阻塞。


genfromtxt makes some sort of test run of your converters:

genfromtxt对转换器进行了一些测试:

    # Find the value to test:
    if len(first_line):
        testing_value = first_values[i]
    else:
        testing_value = None
    converters[i].update(conv, locked=True,
                         testing_value=testing_value,
                         default=filling_values[i],
                         missing_values=missing_values[i],)
    uc_update.append((i, conv))

Looks like it is taking the first data line:

看起来好像是第一个数据线:

4/2/2015 2:13:44 PM,DSN001000557867,03-01-01,0010155818,0,0,N33

splitting it on the delimiter, and using the 3rd string, 03-01-01, as the test value. i.e instead of 6, it is using the index of 6 in your usecols parameter. It's having problems matching the usecols, the converters ids, names and maybe the dtype.

将其拆分为分隔符,并使用第三个字符串03-01-01作为测试值。我。e而不是6,它使用的是你的usecols参数中的6。它与usecols、转换器id、名称和dtype匹配有问题。

The purpose of this test value is to determine the dtype for the column. This is needed in the dtype=None case. I don't know how it is used if you specify the dtype. Evidently it still runs it.

这个测试值的目的是确定该列的dtype。这在dtype=None中是必需的。如果指定dtype,我不知道它是如何使用的。显然,它还在运行。

In tests where I am not skipping columns, it has no problem matching converters and test values.

在我不跳过列的测试中,它没有匹配转换器和测试值的问题。