正则表达式从CSV中删除双倍的双引号

I have an excel sheet that has a lot of data in it in one column in the form of a python dictionary from a sql database. I don't have access to the original database and I can't import the CSV back into sql with the local infile command due to the fact that the keys/values on each row of the CSV are not in the same order. When I export the excel sheet to CSV I get:

我有一个excel表,其中有很多数据在一列中以sql数据库中的python字典的形式存在。我无法访问原始数据库,因为CSV的每一行上的键/值的顺序不同,我无法使用本地infile命令将CSV导回到sql中。当我将Excel工作表导出为CSV时,我得到:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

What is the best way to remove the " before and after the curly brackets as well as the extra " around the keys/values?

在键/值周围删除“大括号之前和之后以及额外”的最佳方法是什么?

I also need to leave the integers alone that don't have quotes around them.

我还需要单独留下没有引号的整数。

I am trying to then import this into python with the json module so that I can print specific keys but I can't import them with the doubled double quotes. I ultimately need the data saved in a file that looks like:

我试图然后使用json模块将其导入到python中,以便我可以打印特定的键但我无法使用doubled双引号导入它们。我最终需要保存在文件中的数据,如下所示:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Any help is most appreciated!

任何帮助都非常感谢!

5 个解决方案

#1

If the input file is just as shown, and of the small size you mention, you can load the whole file in memory, make the substitutions, and then save it. IMHO, you don't need a RegEx to do this. The easiest to read code that does this is:

如果输入文件如图所示,并且您提到的是小尺寸,则可以将整个文件加载到内存中,进行替换,然后保存。恕我直言,你不需要RegEx这样做。执行此操作的最简单的代码是:

with open(filename) as f:
    input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
    f.write(input)

I tested it with the sample input and it produces:

我用样本输入测试它,它产生:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Which is exactly what you want.

这正是你想要的。

If you want, you can also pack the code and write

如果需要,您还可以打包代码并编写

with open(inputFilename) as if:
    with open(outputFilename, "w") as of:
        of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))

but I think the first one is much clearer and both do exactly the same.

但我认为第一个更清晰,两者都完全一样。

#2

Easy:

`text = re.sub(r'"(?!")', '', text)`

Given the input file: TEST.TXT:

给定输入文件:TEST.TXT:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

The script:

import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)

produces the following output:

产生以下输出:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

#3

This should do it:

这应该这样做:

with open('old.csv') as old, open('new.csv', 'w') as new:
    new.writelines(re.sub(r'"(?!")', '', line) for line in old)

#4

I think you are overthinking the problem, why don't replace data?

我认为你是在思考这个问题,为什么不替换数据呢?

l = list()
with open('foo.txt') as f:
    for line in f:
        l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)

print s # or save it to file

It generates:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string

使用列表存储中间行,然后调用.join以提高性能,如附加到字符串的好方法中所述

#5

You can actual use the csv module and regex to do this:

您可以实际使用csv模块和正则表达式来执行此操作:

st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''

import csv, re

data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
    data.extend(line)

s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s

Prints

{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

#1

with open(filename) as f:
    input= f.read()
input= str.replace('""','"')
input= str.replace('"{','{')
input= str.replace('}"','}')
with open(filename, "w") as f:
    f.write(input)

I tested it with the sample input and it produces:

我用样本输入测试它,它产生:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Which is exactly what you want.

这正是你想要的。

If you want, you can also pack the code and write

如果需要,您还可以打包代码并编写

with open(inputFilename) as if:
    with open(outputFilename, "w") as of:
        of.write(if.read().replace('""','"').replace('"{','{').replace('}"','}'))

but I think the first one is much clearer and both do exactly the same.

但我认为第一个更清晰,两者都完全一样。

#2

Easy:

`text = re.sub(r'"(?!")', '', text)`

Given the input file: TEST.TXT:

给定输入文件:TEST.TXT:

"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"

The script:

import re
f = open("TEST.TXT","r")
text_in = f.read()
text_out = re.sub(r'"(?!")', '', text_in)
print(text_out)

produces the following output:

产生以下输出:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

#3

This should do it:

这应该这样做:

with open('old.csv') as old, open('new.csv', 'w') as new:
    new.writelines(re.sub(r'"(?!")', '', line) for line in old)

#4

I think you are overthinking the problem, why don't replace data?

我认为你是在思考这个问题,为什么不替换数据呢?

l = list()
with open('foo.txt') as f:
    for line in f:
        l.append(line.replace('""','"').replace('"{','{').replace('}"','}'))
s = ''.join(l)

print s # or save it to file

It generates:

{"first_name":"John","last_name":"Smith","age":30}
{"first_name":"Tim","last_name":"Johnson","age":34}

Use a list to store intermediate lines and then invoke .join for improving performance as explained in Good way to append to a string

使用列表存储中间行,然后调用.join以提高性能,如附加到字符串的好方法中所述

#5

You can actual use the csv module and regex to do this:

您可以实际使用csv模块和正则表达式来执行此操作:

st='''\
"{""first_name"":""John"",""last_name"":""Smith"",""age"":30}"
"{""first_name"":""Tim"",""last_name"":""Johnson"",""age"":34}"\
'''

import csv, re

data=[]
reader=csv.reader(st, dialect='excel')
for line in reader:
    data.extend(line)

s=re.sub(r'(\w+)',r'"\1"',''.join(data))
s=re.sub(r'({[^}]+})',r'\1\n',s).strip()
print s

Prints

{"first_name":"John","last_name":"Smith","age":"30"}
{"first_name":"Tim","last_name":"Johnson","age":"34"}

秒客网

正则表达式从CSV中删除双倍的双引号

5 个解决方案

#1

#2

`text = re.sub(r'"(?!")', '', text)`

#3

#4

#5

#1

#2

`text = re.sub(r'"(?!")', '', text)`

#3

#4

#5

相关文章