如何使用Regex使用Python查找按字母顺序排列的字符字符串?

时间:2021-03-31 19:24:59

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).

所以我有一个挑战我正在努力——找到一个字符串中最长的字母字符串。例如,“abcghiijkyxz”将导致“ghiijk”(是的,i是双倍的)。

I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.

为了解决这个问题,我已经对循环做了很多工作——遍历整个字符串,然后遍历每个字符,使用low and ord启动第二个循环。

However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?

然而,有人建议我说Regex对于这类事情是很好的。我的regex很弱(我知道如何获取静态集,我的查找知识扩展到知道它们的存在)。我该如何写一个Regex来展望未来,并检查下一个字母顺序的未来字符?或者使用Regex的建议对这种类型的东西不实用吗?

Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.

编辑:大家一致认为Regex对于这类事情确实很糟糕。

6 个解决方案

#1


9  

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.

为了证明regex对于这类事情是不实际的,这里是一个regex,它将匹配您在abcghiijkyxz的例子中的ghiijk。注意,它也会匹配abc、y、x、z,因为从技术上来说,它们应该是按字母顺序排列的最长字符串。不幸的是,您无法确定单独使用regex时哪个最长,但这确实提供了所有的可能性。请注意,这个regex适用于PCRE,不会使用python的re模块!另外,请注意,python的regex库目前不支持(*ACCEPT)。虽然我还没有测试,但是pyre2包(使用Cython为谷歌的re2 pyre2编写的python包装程序)声称它支持(*ACCEPT)控制谓词,因此目前使用python可能实现这一点。

See regex in use here

请参阅这里使用的regex

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

Results in:

结果:

abc
ghiijk
y
x
z

Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):

解释单一选项,即a+(? !b)(*接受):

  • a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
  • a+匹配一个(字面上的)1次或多次。这将捕获几个相同字符序列(如aa)的实例。
  • (?(?!b)(*ACCEPT)) If clause evaluating the condition.
    • (?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
    • (!b) if条款的条件。前面的负号确保后面的不是b,这是因为如果不是b,我们希望下面的控制动词生效。
    • (*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
    • (*ACCEPT)如果满足上述条件,我们接受当前的解决方案。这个控制动词使regex成功结束,跳过了模式的其余部分。由于该令牌位于捕获组中,因此只有捕获组在该特定位置成功结束,而父模式继续执行。
  • 如果条款评估条件的话,那就接受。(!b) if条款的条件。负前视确保后面的不是b,这是因为如果不是b,我们希望下面的控制动词起作用。(*ACCEPT)如果满足上述条件,我们接受当前的解决方案。这个控制动词使regex成功结束,跳过了模式的其余部分。由于该令牌位于捕获组中,因此只有捕获组在该特定位置成功结束,而父模式继续执行。

So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.

如果条件不满足会发生什么?这意味着(?!b)被赋值为false。这意味着下面的字符实际上是b,因此我们允许继续匹配(而不是在本例中捕获)。注意,整个模式都封装在(?:)+中,这允许我们匹配连续的选项,直到(*ACCEPT)控件谓词或行尾满足。

The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

整个正则表达式的唯一例外是z。被它是英语字母表中的最后一个字符(我猜是这个问题的目标),之后我们不关心,所以我们可以简单地把z +(?(? !)(*接受)),这将确保没有匹配后z。如果是你,相反,咱想匹配(圆形字母顺序匹配——idk如果这是正确的术语,但它听起来对我)可以使用z +(?(? !)(*接受)))+如图所示。

#2


2  

As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:

如前所述,regex并不是最好的工具。既然你对连续序列感兴趣,你可以用一个for循环来做:

def LNDS(s):
    start = 0
    cur_len = 1
    max_len = 1
    for i in range(1,len(s)):
        if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
            cur_len += 1
        else:
            if cur_len > max_len:
                max_len = cur_len
                start = i - cur_len
            cur_len = 1
    if cur_len > max_len:
        max_len = cur_len
        start = len(s) - cur_len
    return s[start:start+max_len]

>>> LNDS('abcghiijkyxz')
'ghiijk'

We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.

我们保持一个运行的总数,我们看到了多少非递减字符,当非递减序列结束时,我们将它与我们之前看到的最长的非递减序列进行比较,如果它更长,我们将更新我们的“到目前为止看到的最好的”序列。

#3


2  

Generate all the regex substrings like ^a+b+c+$ (longest to shortest). Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.

生成所有正则表达式的子像+ b + c ^ + $(最长最短)。然后将这些regexs与“abcghiijkyxz”的所有子字符串(最长到最短)进行匹配,并在第一次匹配时停止。

def all_substrings(s):
    n = len(s)
    for i in xrange(n, 0, -1):
        for j in xrange(n - i + 1):
            yield s[j:j + i]

def longest_alphabetical_substring(s):
    for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
        r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
        for u in all_substrings(s):
            if r.match(u):
                return u

print longest_alphabetical_substring("abcghiijkyxz")

That prints "ghiijk".

打印“ghiijk”。

#4


1  

Regex: char+ meaning a+b+c+...

Regex:char +意义+ b + c +……

Details:

细节:

  • + Matches between one and unlimited times
  • +匹配1到无限次。

Python code:

Python代码:

import re

def LNDS(text):
    array = []

    for y in range(97, 122):  # a - z
        st = r"%s+" % chr(y)
        for x in range(y+1, 123):  # b - z
            st += r"%s+" % chr(x)
            match = re.findall(st, text)

            if match:
                array.append(max(match, key=len))
            else:
                break

        if array:
            array = [max(array, key=len)]

    return array

Output:

输出:

print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']

For string abcghiijkyxz regex pattern:

对于字符串abcghiijkyxz regex模式:

a+b+                    i+j+k+l+
a+b+c+                  j+k+
a+b+c+d+                j+k+l+
b+c+                    k+l+
b+c+d+                  l+m+
c+d+                    m+n+
d+e+                    n+o+
e+f+                    o+p+
f+g+                    p+q+
g+h+                    q+r+
g+h+i+                  r+s+
g+h+i+j+                s+t+
g+h+i+j+k+              t+u+
g+h+i+j+k+l+            u+v+
h+i+                    v+w+
h+i+j+                  w+x+
h+i+j+k+                x+y+
h+i+j+k+l+              y+z+
i+j+
i+j+k+

Code demo

代码演示

#5


0  

To actually "solve" the problem, you could use

要真正“解决”问题,你可以使用

string = 'abcxyzghiijkl'

def sort_longest(string):
    stack = []; result = [];

    for idx, char in enumerate(string):
        c = ord(char)
        if idx == 0:
            # initialize our stack
            stack.append((char, c))
        elif idx == len(string) - 1:
            result.append(stack)
        elif c == stack[-1][1] or c == stack[-1][1] + 1:
            # compare it to the item before (a tuple)
            stack.append((char, c))
        else:
            # append the stack to the overall result
            # and reinitialize the stack
            result.append(stack)
            stack = []
            stack.append((char, c))

    return ["".join(item[0]
        for item in sublst) 
        for sublst in sorted(result, key=len, reverse=True)]

print(sort_longest(string))

Which yields

的收益率

['ghiijk', 'abc', 'xyz']

in this example.

在这个例子中。


The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().

#6


0  

It's really easy with regexps!

使用regexp真的很容易!

(Using trailing contexts here)

(这里使用拖曳上下文)

rexp=re.compile(
    "".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
            for x in "abcdefghijklmnopqrstuvwxyz"])
    +'[a-z]')

a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'

re.findall(rexp, a)

#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']

#1


9  

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.

为了证明regex对于这类事情是不实际的,这里是一个regex,它将匹配您在abcghiijkyxz的例子中的ghiijk。注意,它也会匹配abc、y、x、z,因为从技术上来说,它们应该是按字母顺序排列的最长字符串。不幸的是,您无法确定单独使用regex时哪个最长,但这确实提供了所有的可能性。请注意,这个regex适用于PCRE,不会使用python的re模块!另外,请注意,python的regex库目前不支持(*ACCEPT)。虽然我还没有测试,但是pyre2包(使用Cython为谷歌的re2 pyre2编写的python包装程序)声称它支持(*ACCEPT)控制谓词,因此目前使用python可能实现这一点。

See regex in use here

请参阅这里使用的regex

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

Results in:

结果:

abc
ghiijk
y
x
z

Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):

解释单一选项,即a+(? !b)(*接受):

  • a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
  • a+匹配一个(字面上的)1次或多次。这将捕获几个相同字符序列(如aa)的实例。
  • (?(?!b)(*ACCEPT)) If clause evaluating the condition.
    • (?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
    • (!b) if条款的条件。前面的负号确保后面的不是b,这是因为如果不是b,我们希望下面的控制动词生效。
    • (*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.
    • (*ACCEPT)如果满足上述条件,我们接受当前的解决方案。这个控制动词使regex成功结束,跳过了模式的其余部分。由于该令牌位于捕获组中,因此只有捕获组在该特定位置成功结束,而父模式继续执行。
  • 如果条款评估条件的话,那就接受。(!b) if条款的条件。负前视确保后面的不是b,这是因为如果不是b,我们希望下面的控制动词起作用。(*ACCEPT)如果满足上述条件,我们接受当前的解决方案。这个控制动词使regex成功结束,跳过了模式的其余部分。由于该令牌位于捕获组中,因此只有捕获组在该特定位置成功结束,而父模式继续执行。

So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.

如果条件不满足会发生什么?这意味着(?!b)被赋值为false。这意味着下面的字符实际上是b,因此我们允许继续匹配(而不是在本例中捕获)。注意,整个模式都封装在(?:)+中,这允许我们匹配连续的选项,直到(*ACCEPT)控件谓词或行尾满足。

The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

整个正则表达式的唯一例外是z。被它是英语字母表中的最后一个字符(我猜是这个问题的目标),之后我们不关心,所以我们可以简单地把z +(?(? !)(*接受)),这将确保没有匹配后z。如果是你,相反,咱想匹配(圆形字母顺序匹配——idk如果这是正确的术语,但它听起来对我)可以使用z +(?(? !)(*接受)))+如图所示。

#2


2  

As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:

如前所述,regex并不是最好的工具。既然你对连续序列感兴趣,你可以用一个for循环来做:

def LNDS(s):
    start = 0
    cur_len = 1
    max_len = 1
    for i in range(1,len(s)):
        if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
            cur_len += 1
        else:
            if cur_len > max_len:
                max_len = cur_len
                start = i - cur_len
            cur_len = 1
    if cur_len > max_len:
        max_len = cur_len
        start = len(s) - cur_len
    return s[start:start+max_len]

>>> LNDS('abcghiijkyxz')
'ghiijk'

We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.

我们保持一个运行的总数,我们看到了多少非递减字符,当非递减序列结束时,我们将它与我们之前看到的最长的非递减序列进行比较,如果它更长,我们将更新我们的“到目前为止看到的最好的”序列。

#3


2  

Generate all the regex substrings like ^a+b+c+$ (longest to shortest). Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.

生成所有正则表达式的子像+ b + c ^ + $(最长最短)。然后将这些regexs与“abcghiijkyxz”的所有子字符串(最长到最短)进行匹配,并在第一次匹配时停止。

def all_substrings(s):
    n = len(s)
    for i in xrange(n, 0, -1):
        for j in xrange(n - i + 1):
            yield s[j:j + i]

def longest_alphabetical_substring(s):
    for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
        r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
        for u in all_substrings(s):
            if r.match(u):
                return u

print longest_alphabetical_substring("abcghiijkyxz")

That prints "ghiijk".

打印“ghiijk”。

#4


1  

Regex: char+ meaning a+b+c+...

Regex:char +意义+ b + c +……

Details:

细节:

  • + Matches between one and unlimited times
  • +匹配1到无限次。

Python code:

Python代码:

import re

def LNDS(text):
    array = []

    for y in range(97, 122):  # a - z
        st = r"%s+" % chr(y)
        for x in range(y+1, 123):  # b - z
            st += r"%s+" % chr(x)
            match = re.findall(st, text)

            if match:
                array.append(max(match, key=len))
            else:
                break

        if array:
            array = [max(array, key=len)]

    return array

Output:

输出:

print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']

For string abcghiijkyxz regex pattern:

对于字符串abcghiijkyxz regex模式:

a+b+                    i+j+k+l+
a+b+c+                  j+k+
a+b+c+d+                j+k+l+
b+c+                    k+l+
b+c+d+                  l+m+
c+d+                    m+n+
d+e+                    n+o+
e+f+                    o+p+
f+g+                    p+q+
g+h+                    q+r+
g+h+i+                  r+s+
g+h+i+j+                s+t+
g+h+i+j+k+              t+u+
g+h+i+j+k+l+            u+v+
h+i+                    v+w+
h+i+j+                  w+x+
h+i+j+k+                x+y+
h+i+j+k+l+              y+z+
i+j+
i+j+k+

Code demo

代码演示

#5


0  

To actually "solve" the problem, you could use

要真正“解决”问题,你可以使用

string = 'abcxyzghiijkl'

def sort_longest(string):
    stack = []; result = [];

    for idx, char in enumerate(string):
        c = ord(char)
        if idx == 0:
            # initialize our stack
            stack.append((char, c))
        elif idx == len(string) - 1:
            result.append(stack)
        elif c == stack[-1][1] or c == stack[-1][1] + 1:
            # compare it to the item before (a tuple)
            stack.append((char, c))
        else:
            # append the stack to the overall result
            # and reinitialize the stack
            result.append(stack)
            stack = []
            stack.append((char, c))

    return ["".join(item[0]
        for item in sublst) 
        for sublst in sorted(result, key=len, reverse=True)]

print(sort_longest(string))

Which yields

的收益率

['ghiijk', 'abc', 'xyz']

in this example.

在这个例子中。


The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().

#6


0  

It's really easy with regexps!

使用regexp真的很容易!

(Using trailing contexts here)

(这里使用拖曳上下文)

rexp=re.compile(
    "".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
            for x in "abcdefghijklmnopqrstuvwxyz"])
    +'[a-z]')

a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'

re.findall(rexp, a)

#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']