如何在Ruby中使用regex将字符串分割成包含的单词数组?

时间:2023-02-10 19:28:57

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:

我正在尝试创建一个regex模式,它将根据许多不同的模式和约定将字符串拆分为一系列单词。规则如下:

  1. It must split the string on all dashes, spaces, underscores, and periods.
  2. 它必须在所有的破折号、空格、下划线和句点上分割字符串。
  3. When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
  4. 当前面提到的多个角色同时出现时,它只能分裂一次。快速的必须分割(“的”,“快速”),而不是(“、”、“‘快速’])
  5. It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
  6. 它必须在新的大写字母上拆分字符串,同时使用相应的单词(“theQuickBrown”拆分为“the”、“quick”和“brown”)
  7. It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
  8. 它必须将多个大写字母组合在一起('LETS_GO'必须拆分为['let ', 'go'],而不是['l', 'e', 't', 's', 'g', 'o'])
  9. It must use only lowercase letters in the split array.
  10. 它必须在分割数组中使用小写字母。

If it is working properly, the following should be true

如果工作正常,下面应该是正确的。

"theQuick--brown_fox JumpsOver___the.lazy  DOG".split_words == 
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]

到目前为止,我几乎达到了这个目标,唯一的问题是它在每个首都都是分裂的,所以是“狗”。split_words是["d", "o", "g"]而不是["dog"]

I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.

我还使用了regex和map /filter的组合来获得解决方案,如果你能告诉我如何处理这些,并且只使用regex,就可以得到额外的积分。

Here's what I have so far:

这是我目前所拥有的:

class String
  def split_words 
    split(/[_,\-, ,.]|(?=[A-Z]+)/).
    map(&:downcase).
    reject(&:empty?)
  end 
end

Which when called on the string from the test above returns:

当在测试中调用该字符串时,它返回:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]

How can I update this method to meet all of the above specs?

如何更新此方法以满足上述规格?

3 个解决方案

#1


4  

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

您可以使用匹配的方法来提取两个或多个大写字母或字母的块,而只使用0+小写字母:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo.

请参见Ruby演示和小兔演示。

The regex matches:

正则表达式匹配:

  • \p{Lu}{2,} - 2 or more uppercase letters
  • \p{2,} - 2或更多大写字母
  • | - or
  • |——或者
  • \p{L} - any letter
  • \ p { L } -任何信件
  • \p{Ll}* - 0 or more lowercase letters.
  • \p{Ll}* - 0或更多小写字母。

With map(&:downcase), the items you get with .scan() are turned to lower case.

使用map(&:downcase),使用.scan()获得的项目将被转换为小写。

#2


5  

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+

你可以稍微改变一下正则表达式,这样它就不会在每一个大写字母上出现分裂,而是每一个字母序列都以大写字母开头。这只需要在[a-z]+后面加上[a-z]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

#3


2  

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

如果字符串在空格和所有标点符号上被破坏,请替换[- _]。与[[:punct:]]]+ +。在Regexp上搜索“[[:punct:]]”以供参考。

#1


4  

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

您可以使用匹配的方法来提取两个或多个大写字母或字母的块,而只使用0+小写字母:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo.

请参见Ruby演示和小兔演示。

The regex matches:

正则表达式匹配:

  • \p{Lu}{2,} - 2 or more uppercase letters
  • \p{2,} - 2或更多大写字母
  • | - or
  • |——或者
  • \p{L} - any letter
  • \ p { L } -任何信件
  • \p{Ll}* - 0 or more lowercase letters.
  • \p{Ll}* - 0或更多小写字母。

With map(&:downcase), the items you get with .scan() are turned to lower case.

使用map(&:downcase),使用.scan()获得的项目将被转换为小写。

#2


5  

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+

你可以稍微改变一下正则表达式,这样它就不会在每一个大写字母上出现分裂,而是每一个字母序列都以大写字母开头。这只需要在[a-z]+后面加上[a-z]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

#3


2  

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

如果字符串在空格和所有标点符号上被破坏,请替换[- _]。与[[:punct:]]]+ +。在Regexp上搜索“[[:punct:]]”以供参考。