将用户输入字符串转换为原始字符串文字以构造正则表达式

时间:2022-03-09 12:34:33

I know there are some posts about convert string to raw string literal, but none of them help my situation.

我知道有一些关于将字符串转换为原始字符串文字的帖子,但它们都没有帮助我的情况。

My problem is:

我的问题是:

Say, for example, I want to know whether the pattern "\section" is in the text "abcd\sectiondefghi". Of course, I can do this:

比方说,我想知道模式“\ section”是否在文本“abcd \ sectiondefghi”中。当然,我可以这样做:

import re

motif = r"\\section"
txt = r"abcd\sectiondefghi"
pattern = re.compile(motif)
print pattern.findall(txt)

That will give me what I want. However, each time I want to find a new pattern in a new text, I have to change the code which is painful. Therefore, I want to write something more flexible, like this (test.py):

这会给我我想要的东西。但是,每次我想在新文本中找到新模式时,我都必须更改令人痛苦的代码。因此,我想写一些更灵活的东西,比如这个(test.py):

import re
import sys

motif = sys.argv[1]
txt = sys.argv[2]
pattern = re.compile(motif)
print pattern.findall(txt)

Then, I want to run it in terminal like this:

然后,我想在终端中运行它,如下所示:

python test.py \\section abcd\sectiondefghi

However, that will not work (I hate to use \\\\section).

但是,这不起作用(我讨厌使用\\\\ section)。

So, is there any way of converting my user input (either from terminal or from a file) to python raw string? Or is there a better way of doing the regular expression pattern compilation from user input?

那么,有没有办法将我的用户输入(从终端或从文件)转换为python原始字符串?或者是否有更好的方法从用户输入进行正则表达式模式编译?

Thank you very much.

非常感谢你。

3 个解决方案

#1


25  

Use re.escape() to make sure input text is treated as literal text in a regular expression:

使用re.escape()确保输入文本在正则表达式中被视为文本文本:

pattern = re.compile(re.escape(motif))

Demo:

演示:

>>> import re
>>> motif = r"\section"
>>> txt = r"abcd\sectiondefghi"
>>> pattern = re.compile(re.escape(motif))
>>> txt = r"abcd\sectiondefghi"
>>> print pattern.findall(txt)
['\\section']

re.escape() escapes all non-alphanumerics; adding a backslash in front of each such a character:

re.escape()逃避所有非字母数字;在每个这样的字符前面添加一个反斜杠:

>>> re.escape(motif)
'\\\\section'
>>> re.escape('\n [hello world!]')
'\\\n\\ \\[hello\\ world\\!\\]'

#2


3  

One way to do this is using an argument parser, like optparse or argparse.

一种方法是使用参数解析器,如optparse或argparse。

Your code would look something like this:

您的代码看起来像这样:

import re
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-s", "--string", dest="string",
                  help="The string to parse")
parser.add_option("-r", "--regexp", dest="regexp",
                  help="The regular expression")
parser.add_option("-a", "--action", dest="action", default='findall',
                  help="The action to perform with the regexp")

(options, args) = parser.parse_args()

print getattr(re, options.action)(re.escape(options.regexp), options.string)

An example of me using it:

我使用它的一个例子:

> code.py -s "this is a string" -r "this is a (\S+)"
['string']

Using your example:

使用你的例子:

> code.py -s "abcd\sectiondefghi" -r "\section"
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.

#3


0  

So just to be clear, is the thing you search for ("\section" in your example) supposed to be a regular expression or a literal string? If the latter, the re module isn't really the right tool for the task; given a search string needle and a target string haystack, you can do:

所以要清楚一点,你搜索的东西(在你的例子中是“\ section”)应该是正则表达式还是文字字符串?如果是后者,则re模块实际上不是该任务的正确工具;给定搜索字符串针和目标字符串haystack,您可以执行以下操作:

# is it in there
needle in haystack

# how many copies are there
n = haystack.count(needle)
python test.py \\section abcd\sectiondefghi
# where is it
ix = haystack.find(needle)

all of which are more efficient than the regexp-based version.

所有这些都比基于正则表达式的版本更有效。

re.escape is still useful if you need to insert a literal fragment into a larger regexp at runtime, but if you end up doing re.compile(re.escape(needle)), there are for most cases better tools for the task.

如果你需要在运行时将一个文字片段插入一个更大的正则表达式,re.escape仍然很有用,但是如果你最终做了re.compile(re.escape(needle)),那么对于大多数情况来说,这个任务有更好的工具。

EDIT: I'm beginning to suspect that the real issue here is the shell's escaping rules, which has nothing to do with Python or raw strings. That is, if you type:

编辑:我开始怀疑这里真正的问题是shell的转义规则,它与Python或原始字符串无关。也就是说,如果你输入:

python test.py \\section abcd\sectiondefghi

into a Unix-style shell, the "\section" part is converted to "\section" by the shell, before Python sees it. The simplest way to fix that is to tell the shell to skip unescaping, which you can do by putting the argument inside single quotes:

在Python看到之前,shell中的“\ section”部分被转换为“\ section”。解决这个问题的最简单方法是告诉shell跳过unescaping,你可以把参数放在单引号中:

python test.py '\\section' 'abcd\sectiondefghi'

Compare and contrast:

比较和对比:

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi
-c,test.py,\section,abcdsectiondefghi

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi'
-c,test.py,\\section,abcd\sectiondefghi

(explicitly using print on a joined string here to avoid repr adding even more confusion...)

(在这里明确地使用连接字符串上的print来避免repr添加更多的混淆......)

#1


25  

Use re.escape() to make sure input text is treated as literal text in a regular expression:

使用re.escape()确保输入文本在正则表达式中被视为文本文本:

pattern = re.compile(re.escape(motif))

Demo:

演示:

>>> import re
>>> motif = r"\section"
>>> txt = r"abcd\sectiondefghi"
>>> pattern = re.compile(re.escape(motif))
>>> txt = r"abcd\sectiondefghi"
>>> print pattern.findall(txt)
['\\section']

re.escape() escapes all non-alphanumerics; adding a backslash in front of each such a character:

re.escape()逃避所有非字母数字;在每个这样的字符前面添加一个反斜杠:

>>> re.escape(motif)
'\\\\section'
>>> re.escape('\n [hello world!]')
'\\\n\\ \\[hello\\ world\\!\\]'

#2


3  

One way to do this is using an argument parser, like optparse or argparse.

一种方法是使用参数解析器,如optparse或argparse。

Your code would look something like this:

您的代码看起来像这样:

import re
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-s", "--string", dest="string",
                  help="The string to parse")
parser.add_option("-r", "--regexp", dest="regexp",
                  help="The regular expression")
parser.add_option("-a", "--action", dest="action", default='findall',
                  help="The action to perform with the regexp")

(options, args) = parser.parse_args()

print getattr(re, options.action)(re.escape(options.regexp), options.string)

An example of me using it:

我使用它的一个例子:

> code.py -s "this is a string" -r "this is a (\S+)"
['string']

Using your example:

使用你的例子:

> code.py -s "abcd\sectiondefghi" -r "\section"
['\\section'] 
# remember, this is a python list containing a string, the extra \ is okay.

#3


0  

So just to be clear, is the thing you search for ("\section" in your example) supposed to be a regular expression or a literal string? If the latter, the re module isn't really the right tool for the task; given a search string needle and a target string haystack, you can do:

所以要清楚一点,你搜索的东西(在你的例子中是“\ section”)应该是正则表达式还是文字字符串?如果是后者,则re模块实际上不是该任务的正确工具;给定搜索字符串针和目标字符串haystack,您可以执行以下操作:

# is it in there
needle in haystack

# how many copies are there
n = haystack.count(needle)
python test.py \\section abcd\sectiondefghi
# where is it
ix = haystack.find(needle)

all of which are more efficient than the regexp-based version.

所有这些都比基于正则表达式的版本更有效。

re.escape is still useful if you need to insert a literal fragment into a larger regexp at runtime, but if you end up doing re.compile(re.escape(needle)), there are for most cases better tools for the task.

如果你需要在运行时将一个文字片段插入一个更大的正则表达式,re.escape仍然很有用,但是如果你最终做了re.compile(re.escape(needle)),那么对于大多数情况来说,这个任务有更好的工具。

EDIT: I'm beginning to suspect that the real issue here is the shell's escaping rules, which has nothing to do with Python or raw strings. That is, if you type:

编辑:我开始怀疑这里真正的问题是shell的转义规则,它与Python或原始字符串无关。也就是说,如果你输入:

python test.py \\section abcd\sectiondefghi

into a Unix-style shell, the "\section" part is converted to "\section" by the shell, before Python sees it. The simplest way to fix that is to tell the shell to skip unescaping, which you can do by putting the argument inside single quotes:

在Python看到之前,shell中的“\ section”部分被转换为“\ section”。解决这个问题的最简单方法是告诉shell跳过unescaping,你可以把参数放在单引号中:

python test.py '\\section' 'abcd\sectiondefghi'

Compare and contrast:

比较和对比:

$ python -c "import sys; print ','.join(sys.argv)" test.py \\section abcd\sectiondefghi
-c,test.py,\section,abcdsectiondefghi

$ python -c "import sys; print ','.join(sys.argv)" test.py '\\section' 'abcd\sectiondefghi'
-c,test.py,\\section,abcd\sectiondefghi

(explicitly using print on a joined string here to avoid repr adding even more confusion...)

(在这里明确地使用连接字符串上的print来避免repr添加更多的混淆......)