用于CSV拆分的正则表达式包括多个双引号

时间:2022-09-15 15:23:24

I have a CSV column data containing text. Each row is separated with double quotes "

我有一个包含文本的CSV列数据。每行用双引号分隔“

Sample text in a row is similar to this (notice: new lines and the spaces before each line are intended)

连续示例文本与此类似(注意:新行和每行之前的空格)

"Lorem ipsum dolor sit amet, 
 consectetur adipisicing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna 
 aliqua. Ut ""enim ad"" minim veniam,
 quis nostrud exercitation ullamco laboris nisi 
 ut aliquip ex ea commodo
 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
 cillum dolore eu fugiat ""nulla pariatu"""
"ex ea commodo
 consequat. Duis aute irure ""dolor in"" reprehenderit 
 in voluptate velit esse
 cillum dolore eu fugiat nulla pariatur. 
 Excepteur sint occaecat cupidatat non
 proident, sunt in culpa qui officia deserunt 
 mollit anim id est laborum."

The above represent 2 subsequent rows.

以上代表2个后续行。

I want to select as separated groups all the text contained between every first double quote " (starting a line) and every LAST double quote "

我想选择每个第一个双引号“(起始行)和每个LAST双引号之间包含的所有文本作为分隔组”

As you can see tho, there are line break in the text, along with subsequent escaped double quotes "" wich are part of the text that I need to select.

正如你所看到的,文本中有换行符,以及后续的双引号“”,它是我需要选择的文本的一部分。

I came up with something like this

我想出了类似的东西

(?s)(?!")[^\s](.+?)(?=")

but the multiple double quotes are breaking my desired match

但多个双引号打破了我想要的匹配

I'm a real novice with regex, so I think maybe I'm missing something very basic. Dunno if relevant but I'm using Sublime Text 3 so should be python I think.

我是正则表达式的真正新手,所以我想也许我错过了一些非常基本的东西。 Dunno如果相关但我使用Sublime Text 3所以应该是python我认为。

What can I do to achieve what I need?

我能做些什么来实现我的需求?

2 个解决方案

#1


4  

You can use the following regex:

您可以使用以下正则表达式:

"[^"]*(?:""[^"]*)*"

See demo

见演示

This regex will match either a non-quote, or 2 consequent double quotes inside double quotation marks.

此正则表达式将匹配双引号内的非引号或2个后续双引号。

How does it work? Let me share a graphics from debuggex.com:

它是如何工作的?让我分享一下debuggex.com的图片:

用于CSV拆分的正则表达式包括多个双引号

With the regex, we match:

使用正则表达式,我们匹配:

  • " - (1) - a literal quote
  • “ - (1) - 字面引用
  • [^"]* - (2, 3) - 0 or more characters other than a quote (yes, including a newline, this is a negated character class), if there are none, then the regex searches for the final literal quote (6)
  • [^“] * - (2,3) - 除引号之外的0个或更多字符(是的,包括换行符,这是一个否定的字符类),如果没有,则正则表达式搜索最终的文字引号( 6)
  • (?:""[^"]*)* - (4,5) - 0 or more sequences of:
    • "" - (4) - double double quotation marks
    • “” - (4) - 双倍双引号
    • [^"]* - (5) - 0 or more characters other than a quote
    • [^“] * - (5) - 除引号外的0个或更多字符
  • (?:“”[^“] *)* - (4,5) - 0或更多序列:”“ - (4) - 双重双引号[^”] * - (5) - 0或更多字符除了报价
  • " - (6) - the final literal quote.
  • “ - (6) - 最后的字面引用。

This works faster than "(?:[^"]|"")*" (although yielding the same results), because the processing the former is linear, involving much less backtracking.

这比“(?:[^”] |“”)*“(虽然产生相同的结果)更快,因为前者的处理是线性的,涉及更少的回溯。

#2


3  

If you are using , then you do not need , you can directly use the standard csv library, and double doublequotes inside a single row would be handled automatically. Example (For the csv you posted above in a.csv) -

如果您使用的是python,那么您不需要正则表达式,您可以直接使用标准的csv库,并且可以自动处理单行内的双引号。示例(对于您在a.csv上面发布的csv) -

>>> import csv
>>> with open('a.csv','r') as f:
...     reader = csv.reader(f)
...     for row in reader:
...             print(row)
...
['Lorem ipsum dolor sit amet, \n consectetur adipisicing elit, sed do eiusmod\n tempor incididunt ut labore et dolore magna \n aliqua. Ut "enim ad" minim veniam,\n quis nostrud exercitation ullamco laboris nisi \n ut aliquip ex ea commodo\n consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\n cillum dolore eu fugiat "nulla pariatu"']
['ex ea commodo\n consequat. Duis aute irure "dolor in" reprehenderit \n in voluptate velit esse\n cillum dolore eu fugiat nulla pariatur. \n Excepteur sint occaecat cupidatat non\n proident, sunt in culpa qui officia deserunt \n mollit anim id est laborum.']

This was handled correctly by the csv module basically because " is the default quotechar , so anything within two " is considered part of that single column, even if its \n or spaces, etc.

这是由csv模块正确处理的,因为“是默认的quotechar,所以两个内的任何东西”都被认为是该单列的一部分,即使它是\ n或空格等。

Also, csv module has another argument called doublequote that is -

此外,csv模块还有另一个名为doublequote的参数 -

Controls how instances of quotechar appearing inside a field should be themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.

控制如何引用字段中出现的quotechar实例。如果为True,则角色加倍。如果为False,则escapechar将用作quotechar的前缀。它默认为True。

#1


4  

You can use the following regex:

您可以使用以下正则表达式:

"[^"]*(?:""[^"]*)*"

See demo

见演示

This regex will match either a non-quote, or 2 consequent double quotes inside double quotation marks.

此正则表达式将匹配双引号内的非引号或2个后续双引号。

How does it work? Let me share a graphics from debuggex.com:

它是如何工作的?让我分享一下debuggex.com的图片:

用于CSV拆分的正则表达式包括多个双引号

With the regex, we match:

使用正则表达式,我们匹配:

  • " - (1) - a literal quote
  • “ - (1) - 字面引用
  • [^"]* - (2, 3) - 0 or more characters other than a quote (yes, including a newline, this is a negated character class), if there are none, then the regex searches for the final literal quote (6)
  • [^“] * - (2,3) - 除引号之外的0个或更多字符(是的,包括换行符,这是一个否定的字符类),如果没有,则正则表达式搜索最终的文字引号( 6)
  • (?:""[^"]*)* - (4,5) - 0 or more sequences of:
    • "" - (4) - double double quotation marks
    • “” - (4) - 双倍双引号
    • [^"]* - (5) - 0 or more characters other than a quote
    • [^“] * - (5) - 除引号外的0个或更多字符
  • (?:“”[^“] *)* - (4,5) - 0或更多序列:”“ - (4) - 双重双引号[^”] * - (5) - 0或更多字符除了报价
  • " - (6) - the final literal quote.
  • “ - (6) - 最后的字面引用。

This works faster than "(?:[^"]|"")*" (although yielding the same results), because the processing the former is linear, involving much less backtracking.

这比“(?:[^”] |“”)*“(虽然产生相同的结果)更快,因为前者的处理是线性的,涉及更少的回溯。

#2


3  

If you are using , then you do not need , you can directly use the standard csv library, and double doublequotes inside a single row would be handled automatically. Example (For the csv you posted above in a.csv) -

如果您使用的是python,那么您不需要正则表达式,您可以直接使用标准的csv库,并且可以自动处理单行内的双引号。示例(对于您在a.csv上面发布的csv) -

>>> import csv
>>> with open('a.csv','r') as f:
...     reader = csv.reader(f)
...     for row in reader:
...             print(row)
...
['Lorem ipsum dolor sit amet, \n consectetur adipisicing elit, sed do eiusmod\n tempor incididunt ut labore et dolore magna \n aliqua. Ut "enim ad" minim veniam,\n quis nostrud exercitation ullamco laboris nisi \n ut aliquip ex ea commodo\n consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse\n cillum dolore eu fugiat "nulla pariatu"']
['ex ea commodo\n consequat. Duis aute irure "dolor in" reprehenderit \n in voluptate velit esse\n cillum dolore eu fugiat nulla pariatur. \n Excepteur sint occaecat cupidatat non\n proident, sunt in culpa qui officia deserunt \n mollit anim id est laborum.']

This was handled correctly by the csv module basically because " is the default quotechar , so anything within two " is considered part of that single column, even if its \n or spaces, etc.

这是由csv模块正确处理的,因为“是默认的quotechar,所以两个内的任何东西”都被认为是该单列的一部分,即使它是\ n或空格等。

Also, csv module has another argument called doublequote that is -

此外,csv模块还有另一个名为doublequote的参数 -

Controls how instances of quotechar appearing inside a field should be themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.

控制如何引用字段中出现的quotechar实例。如果为True,则角色加倍。如果为False,则escapechar将用作quotechar的前缀。它默认为True。