什么是一个好的regex来匹配有可选空格的单词?

时间:2021-08-17 23:41:36

Digits are optional, and are only allowed in the end of a word

数字是可选的,并且只允许在单词的末尾

Spaces are optional, and are only allowed in the middle of a word.

空格是可选的,只允许在一个单词中间。

I am pretty much just trying to match the possible months in a few languages, say English and Vietnamese

我只是试着用几种语言来匹配可能的月份,比如英语和越南语

For example, the following are valid matches:

例如,以下是有效匹配:

'June' 'tháng 6'

“6月”“thang 6”

But the following are not because of space: 'June ' ' June'

但以下这些并不是因为篇幅所限:“六月”“六月”

This is my testcases: https://regex101.com/r/pZ0mN3/2.

这是我的测试用例:https://regex101.com/r/pZ0mN3/2。

As you can see, I came up with ^\S[\S ]+\S$ which is kind of working, but I wonder if there's a better way to do it.

正如你所看到的,我想出^ \[\ S]+ \年代美元的工作,但是我想知道如果有一个更好的方法去做。

1 个解决方案

#1


2  

To match a string with no leading and trailing spaces in the JavaScript regex flavor, you can use several options:

要匹配JavaScript regex风格中没有前导和尾随空格的字符串,可以使用以下几个选项:

  • Require the first and the last non-whitespace character with \S (=[^\s]). This can be done with, say, ^\S[\S\s]*\S$. This regex requires at least 2 characters to be in the string. Your regex requires 3 chars in the input since you used +. It won't allow some Unicode whitespaces either.

    需要第一个和最后一个非空字符\ S(= ^ \[S])。说,这可以用^ \[\ S \ S]* \年代美元。这个regex要求字符串中至少有两个字符。您的regex需要输入3个字符,因为您使用了+。它也不允许一些Unicode空格。

  • You may use a combination of grouping with optional quantifiers (those allowing 0 length matches). See ^\S(?:\s*\S+)*$ (where \s is replaced with since it is a multiline demo). The \S at the beginning matches a non-whitespace char and then a non-capturing group follows, that is * quantified (matches zero or more occurrences) and matches 0+ sequences of 0+ whitespaces followed with 1+ non-whitespace characters. This is a good expression for those flavors like RE2 that do not support lookarounds, but support quantified groups.

    您可以使用分组与可选量词(允许0长度匹配的量词)的组合。看到^ \ S(?:\ S * \ S +)* $(\ S所取代,因为它是一个多行演示)。开始的\S匹配一个非空白字符,然后是一个非捕获组,它是* quantified(匹配零个或多个出现),并匹配0+ 0+空格序列,后面是1+非空白字符。对于RE2这样的口味来说,这是一个很好的表达方式,它不支持查找,而是支持量化组。

  • You may use lookaheads to require the first and last character to be non-whitespace characters: ^(?=[\S\s]*\S$)\S[\S\s]*$ where (?=[\s\S]*\S$) requires the last char to be a non-whitespace and the \S after the lookahead will require the first char to be non-whitespace. [\s\S]* matches 0+ any characters. This will match 1 char strings, but won't match empty strings.

    你可以用超前要求第一个和最后一个字符为非空字符:^(? =[\ S \ S]* \新元)\ S[\ S \ S]*美元,(? =[\ S \ S]* \新元)需要一个非空和最后一个字符\ S超前后需要第一个非空字符。[\s]*匹配0+任何字符。这将匹配一个字符字符串,但不匹配空字符串。

  • If your regex to match strings with no leading/trailing whitespaces should also match an empty string, use 2 negative lookaheads: ^(?!\s)(?![\S\s]*\s$)[\S\s]*$. The (?!\s) lookahead will fail the match if there is a leading whitespace, (?![\S\s]*\s$) will do the same in case of trailing whitespace, and [\s\S]* will match 0+ any characters. *If lookarounds are not supported, use ^(?:\S(?: *\S+)*)?$ that is much less efficient.

    如果你的正则表达式匹配的字符串没有领先/落后于空白也应该匹配一个空字符串,使用2 -超前:^(? ! \ s)(? ![\ s \ s]* \新元)[\ s \ s]*美元。如果有一个领先的空格,(?! ![\ s\ s]*\s] s$)在拖尾空格时也会这样做,而[\s\ s\ s]*将匹配0+任何字符。*如果不支持,看看使用^(?:\ S(?):* \ S +)*)?那就没那么有效了。

If you do not need to match any chars between the non-whitespace chars, you may revert [\s\S] to your [\S ]. In PCRE, a horizontal whitespace can be matched with \h, in .NET and others that support Unicode properties, you can use [\t\p{Zs}] to match any horizontal whitespace. In JS, [^\S\r\n\f\v\u2028\u2029] can be used for that purpose.

如果您不需要在非空白字符之间匹配任何字符,您可以将[\s\ s]恢复到您的[\s]。在PCRE中,水平空白可以与\h进行匹配,在。net中以及其他支持Unicode属性的地方,您可以使用[\t\p{Zs}]来匹配任何水平空白。在JS,[^ \ S \ r \ n \ f \ v \ u2028 \ u2029)可用于这一目的。

Note that some regex flavors do not support non-capturing groups, you may replace all (?: with ( in the above patterns.

注意,有些regex风味不支持非捕获组,您可以替换所有(?在上面的模式中。

#1


2  

To match a string with no leading and trailing spaces in the JavaScript regex flavor, you can use several options:

要匹配JavaScript regex风格中没有前导和尾随空格的字符串,可以使用以下几个选项:

  • Require the first and the last non-whitespace character with \S (=[^\s]). This can be done with, say, ^\S[\S\s]*\S$. This regex requires at least 2 characters to be in the string. Your regex requires 3 chars in the input since you used +. It won't allow some Unicode whitespaces either.

    需要第一个和最后一个非空字符\ S(= ^ \[S])。说,这可以用^ \[\ S \ S]* \年代美元。这个regex要求字符串中至少有两个字符。您的regex需要输入3个字符,因为您使用了+。它也不允许一些Unicode空格。

  • You may use a combination of grouping with optional quantifiers (those allowing 0 length matches). See ^\S(?:\s*\S+)*$ (where \s is replaced with since it is a multiline demo). The \S at the beginning matches a non-whitespace char and then a non-capturing group follows, that is * quantified (matches zero or more occurrences) and matches 0+ sequences of 0+ whitespaces followed with 1+ non-whitespace characters. This is a good expression for those flavors like RE2 that do not support lookarounds, but support quantified groups.

    您可以使用分组与可选量词(允许0长度匹配的量词)的组合。看到^ \ S(?:\ S * \ S +)* $(\ S所取代,因为它是一个多行演示)。开始的\S匹配一个非空白字符,然后是一个非捕获组,它是* quantified(匹配零个或多个出现),并匹配0+ 0+空格序列,后面是1+非空白字符。对于RE2这样的口味来说,这是一个很好的表达方式,它不支持查找,而是支持量化组。

  • You may use lookaheads to require the first and last character to be non-whitespace characters: ^(?=[\S\s]*\S$)\S[\S\s]*$ where (?=[\s\S]*\S$) requires the last char to be a non-whitespace and the \S after the lookahead will require the first char to be non-whitespace. [\s\S]* matches 0+ any characters. This will match 1 char strings, but won't match empty strings.

    你可以用超前要求第一个和最后一个字符为非空字符:^(? =[\ S \ S]* \新元)\ S[\ S \ S]*美元,(? =[\ S \ S]* \新元)需要一个非空和最后一个字符\ S超前后需要第一个非空字符。[\s]*匹配0+任何字符。这将匹配一个字符字符串,但不匹配空字符串。

  • If your regex to match strings with no leading/trailing whitespaces should also match an empty string, use 2 negative lookaheads: ^(?!\s)(?![\S\s]*\s$)[\S\s]*$. The (?!\s) lookahead will fail the match if there is a leading whitespace, (?![\S\s]*\s$) will do the same in case of trailing whitespace, and [\s\S]* will match 0+ any characters. *If lookarounds are not supported, use ^(?:\S(?: *\S+)*)?$ that is much less efficient.

    如果你的正则表达式匹配的字符串没有领先/落后于空白也应该匹配一个空字符串,使用2 -超前:^(? ! \ s)(? ![\ s \ s]* \新元)[\ s \ s]*美元。如果有一个领先的空格,(?! ![\ s\ s]*\s] s$)在拖尾空格时也会这样做,而[\s\ s\ s]*将匹配0+任何字符。*如果不支持,看看使用^(?:\ S(?):* \ S +)*)?那就没那么有效了。

If you do not need to match any chars between the non-whitespace chars, you may revert [\s\S] to your [\S ]. In PCRE, a horizontal whitespace can be matched with \h, in .NET and others that support Unicode properties, you can use [\t\p{Zs}] to match any horizontal whitespace. In JS, [^\S\r\n\f\v\u2028\u2029] can be used for that purpose.

如果您不需要在非空白字符之间匹配任何字符,您可以将[\s\ s]恢复到您的[\s]。在PCRE中,水平空白可以与\h进行匹配,在。net中以及其他支持Unicode属性的地方,您可以使用[\t\p{Zs}]来匹配任何水平空白。在JS,[^ \ S \ r \ n \ f \ v \ u2028 \ u2029)可用于这一目的。

Note that some regex flavors do not support non-capturing groups, you may replace all (?: with ( in the above patterns.

注意,有些regex风味不支持非捕获组,您可以替换所有(?在上面的模式中。