使用pandas格式化csv中的电话号码

时间:2022-04-01 11:26:26

Python/pandas n00b. I have code that is processing event data stored in csv files. Data from df["CONTACT PHONE NUMBER"] is outputting the phone number as `5555551212.0' Obviously, the ".0" is a problem, but added because it's an integer, I imagine?

Python / pandas n00b。我有代码处理存储在csv文件中的事件数据。来自df [“联系人电话号码”]的数据输出的电话号码为“5555551212.0”显然,“。0”是一个问题,但是因为它是一个整数,我想?

Anyhoo, I decided that I should format the phone number for usability's sake.

Anyhoo,我决定为了可用性而格式化电话号码。

The number comes from the csv file, unformatted. The number will always be ten digits: 5555551212, but I would like to display it as (555)555-1212.

该数字来自csv文件,未格式化。该数字将始终为十位数:5555551212,但我想将其显示为(555)555-1212。

import glob
import os
import pandas as pd
import sys

csvfiles = os.path.join(directory, '*.csv')
for csvfile in glob.glob(csvfiles):
    df = pd.read_csv(filename)
    #formatting the contact phone
    phone_nos = df["CONTACT PHONE NUMBER"]
    for phone_no in phone_nos:
        contactphone = "(%c%c%c)%c%c%c-%c%c%c%c" % tuple(map(ord,phone_no))

The last line gives me the following error: not enough arguments for format string

最后一行给出了以下错误:没有足够的格式字符串参数

But maybe this isn't the pandas way of doing this. Since I'm iterating through an array, I also need to save the data in its existing column or rebuild that column after the phone numbers have been processed.

但也许这不是熊猫这样做的方式。由于我在迭代数组,我还需要将数据保存在现有列中,或者在处理完电话号码后重建该列。

2 个解决方案

#1


3  

I think the problem is that the phone numbers are stored as float64, so, adding a few things will fix your inner loop:

我认为问题是电话号码存储为float64,因此,添加一些内容将修复您的内部循环:

In [75]:

df['Phone_no']
Out[75]:
0    5554443333
1    1114445555
Name: Phone_no, dtype: float64
In [76]:

for phone_no in df['Phone_no']:
    contactphone = "(%c%c%c)%c%c%c-%c%c%c%c" % tuple(map(ord,list(str(phone_no)[:10])))
    print contactphone
(555)444-3333
(111)444-5555

However, I think it is easier just to have the phone numbers as string (@Andy_Hayden made a good point on missing values, so I made up the following dataset:)

但是,我认为将电话号码作为字符串更容易(@Andy_Hayden对缺失的值做了一个很好的观点,所以我编写了以下数据集:)

In [121]:

print df
     Phone_no   Name
0  5554443333   John
1  1114445555   Jane
2         NaN  Betty

[3 rows x 2 columns]
In [122]:

df.dtypes
Out[122]:
Phone_no    float64
Name         object
dtype: object
#In [123]: You don't need to convert the entire DataFrame, only the 'Phone_no' needs to be converted.
#
#df=df.astype('S4')
In [124]:

df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10])
In [125]:

print df
       Phone_no   Name    PhoneNumber
0  5554443333.0   John  (555)444-3333
1  1114445555.0   Jane  (111)444-5555
2           NaN  Betty         (nan)-

[3 rows x 3 columns]

In [134]:
import numpy as np
df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: np.where((len(x)>=10)&set(list(x)).issubset(list('.0123456789')),
                                                                      '('+x[:3]+')'+x[3:6]+'-'+x[6:10],
                                                                      'Phone number not in record'))
In [135]:

print df
     Phone_no   Name                 PhoneNumber
0  5554443333   John               (555)444-3333
1  1114445555   Jane               (111)444-5555
2         NaN  Betty  Phone number not in record

[3 rows x 3 columns]

#2


4  

I think phone numbers should be stored as a string.
When reading the csv you can ensure this column is read as a string:

我认为电话号码应该存储为字符串。在读取csv时,您可以确保将此列读取为字符串:

pd.read_csv(filename, dtype={"CONTACT PHONE NUMBER": str})

You can use the string methods, naively adding:

您可以使用字符串方法,天真地添加:

In [11]: s = pd.Series(['5554443333', '1114445555', np.nan, '123'])  # df["CONTACT PHONE NUMBER"]

# phone_nos = '(' + s.str[:3] + ')' + s.str[3:7] + '-' + s.str[7:11]

Edit: as Noah answers in a related question, you can do this more directly/efficiently using str.replace:

编辑:正如Noah在相关问题中回答的那样,您可以使用str.replace更直接/更有效地执行此操作:

In [12]: phone_nos = s.str.replace('^(\d{3})(\d{3})(\d{4})$', r'(\1)\2-\3')

In [13]: phone_nos
Out[13]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              123
dtype: object

But there is a problem here as you have a malformed number, not precisely 10 digits, so you could NaN those:

但是这里有一个问题,因为你有一个格式错误的数字,而不是10个数字,所以你可以NaN那些:

In [14]: s.str.contains('^\d{10}$')  # note: NaN is truthy
Out[14]:
0     True
1     True
2      NaN
3    False
dtype: object

In [15]: phone_nos.where(s.str.contains('^\d{10}$'))
Out[15]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              NaN
dtype: object

Now, you might like to inspect the bad formats you have (maybe you have to change your output to encompass them, e.g. if they included a country code):

现在,您可能希望检查您所拥有的错误格式(可能您必须更改输出以包含它们,例如,如果它们包含国家/地区代码):

In [16]: s[~s.str.contains('^\d{10}$').astype(bool)]
Out[16]:
3    123
dtype: object

#1


3  

I think the problem is that the phone numbers are stored as float64, so, adding a few things will fix your inner loop:

我认为问题是电话号码存储为float64,因此,添加一些内容将修复您的内部循环:

In [75]:

df['Phone_no']
Out[75]:
0    5554443333
1    1114445555
Name: Phone_no, dtype: float64
In [76]:

for phone_no in df['Phone_no']:
    contactphone = "(%c%c%c)%c%c%c-%c%c%c%c" % tuple(map(ord,list(str(phone_no)[:10])))
    print contactphone
(555)444-3333
(111)444-5555

However, I think it is easier just to have the phone numbers as string (@Andy_Hayden made a good point on missing values, so I made up the following dataset:)

但是,我认为将电话号码作为字符串更容易(@Andy_Hayden对缺失的值做了一个很好的观点,所以我编写了以下数据集:)

In [121]:

print df
     Phone_no   Name
0  5554443333   John
1  1114445555   Jane
2         NaN  Betty

[3 rows x 2 columns]
In [122]:

df.dtypes
Out[122]:
Phone_no    float64
Name         object
dtype: object
#In [123]: You don't need to convert the entire DataFrame, only the 'Phone_no' needs to be converted.
#
#df=df.astype('S4')
In [124]:

df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10])
In [125]:

print df
       Phone_no   Name    PhoneNumber
0  5554443333.0   John  (555)444-3333
1  1114445555.0   Jane  (111)444-5555
2           NaN  Betty         (nan)-

[3 rows x 3 columns]

In [134]:
import numpy as np
df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: np.where((len(x)>=10)&set(list(x)).issubset(list('.0123456789')),
                                                                      '('+x[:3]+')'+x[3:6]+'-'+x[6:10],
                                                                      'Phone number not in record'))
In [135]:

print df
     Phone_no   Name                 PhoneNumber
0  5554443333   John               (555)444-3333
1  1114445555   Jane               (111)444-5555
2         NaN  Betty  Phone number not in record

[3 rows x 3 columns]

#2


4  

I think phone numbers should be stored as a string.
When reading the csv you can ensure this column is read as a string:

我认为电话号码应该存储为字符串。在读取csv时,您可以确保将此列读取为字符串:

pd.read_csv(filename, dtype={"CONTACT PHONE NUMBER": str})

You can use the string methods, naively adding:

您可以使用字符串方法,天真地添加:

In [11]: s = pd.Series(['5554443333', '1114445555', np.nan, '123'])  # df["CONTACT PHONE NUMBER"]

# phone_nos = '(' + s.str[:3] + ')' + s.str[3:7] + '-' + s.str[7:11]

Edit: as Noah answers in a related question, you can do this more directly/efficiently using str.replace:

编辑:正如Noah在相关问题中回答的那样,您可以使用str.replace更直接/更有效地执行此操作:

In [12]: phone_nos = s.str.replace('^(\d{3})(\d{3})(\d{4})$', r'(\1)\2-\3')

In [13]: phone_nos
Out[13]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              123
dtype: object

But there is a problem here as you have a malformed number, not precisely 10 digits, so you could NaN those:

但是这里有一个问题,因为你有一个格式错误的数字,而不是10个数字,所以你可以NaN那些:

In [14]: s.str.contains('^\d{10}$')  # note: NaN is truthy
Out[14]:
0     True
1     True
2      NaN
3    False
dtype: object

In [15]: phone_nos.where(s.str.contains('^\d{10}$'))
Out[15]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              NaN
dtype: object

Now, you might like to inspect the bad formats you have (maybe you have to change your output to encompass them, e.g. if they included a country code):

现在,您可能希望检查您所拥有的错误格式(可能您必须更改输出以包含它们,例如,如果它们包含国家/地区代码):

In [16]: s[~s.str.contains('^\d{10}$').astype(bool)]
Out[16]:
3    123
dtype: object