如何在Ruby中使用regex将字符串分割成包含的单词数组?

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:

我正在尝试创建一个regex模式，它将根据许多不同的模式和约定将字符串拆分为一系列单词。规则如下:

It must split the string on all dashes, spaces, underscores, and periods.
它必须在所有的破折号、空格、下划线和句点上分割字符串。
When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
当前面提到的多个角色同时出现时，它只能分裂一次。快速的必须分割(“的”,“快速”),而不是(“、”、“‘快速’])
It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
它必须在新的大写字母上拆分字符串，同时使用相应的单词(“theQuickBrown”拆分为“the”、“quick”和“brown”)
It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
它必须将多个大写字母组合在一起('LETS_GO'必须拆分为['let '， 'go']，而不是['l'， 'e'， 't'， 's'， 'g'， 'o'])
It must use only lowercase letters in the split array.
它必须在分割数组中使用小写字母。

If it is working properly, the following should be true

如果工作正常，下面应该是正确的。

"theQuick--brown_fox JumpsOver___the.lazy  DOG".split_words == 
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]

到目前为止，我几乎达到了这个目标，唯一的问题是它在每个首都都是分裂的，所以是“狗”。split_words是["d"， "o"， "g"]而不是["dog"]

I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.

我还使用了regex和map /filter的组合来获得解决方案，如果你能告诉我如何处理这些，并且只使用regex，就可以得到额外的积分。

Here's what I have so far:

这是我目前所拥有的:

class String
  def split_words 
    split(/[_,\-, ,.]|(?=[A-Z]+)/).
    map(&:downcase).
    reject(&:empty?)
  end 
end

Which when called on the string from the test above returns:

当在测试中调用该字符串时，它返回:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]

How can I update this method to meet all of the above specs?

如何更新此方法以满足上述规格?

3 个解决方案

#1

You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:

您可以使用匹配的方法来提取两个或多个大写字母或字母的块，而只使用0+小写字母:

s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)

See the Ruby demo and the Rubular demo.

请参见Ruby演示和小兔演示。

The regex matches:

正则表达式匹配:

\p{Lu}{2,} - 2 or more uppercase letters
\p{2，} - 2或更多大写字母
| - or
|——或者
\p{L} - any letter
\ p { L } -任何信件
\p{Ll}* - 0 or more lowercase letters.
\p{Ll}* - 0或更多小写字母。

With map(&:downcase), the items you get with .scan() are turned to lower case.

使用map(&:downcase)，使用.scan()获得的项目将被转换为小写。

#2

You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+

你可以稍微改变一下正则表达式，这样它就不会在每一个大写字母上出现分裂，而是每一个字母序列都以大写字母开头。这只需要在[a-z]+后面加上[a-z]+

string = "theQuick--brown_fox JumpsOver___the.lazy  DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]

#3

r = /
    [- _.]+      # match one or more combinations of dashes, spaces,
                 # underscores and periods
    |            # or
    (?<=\p{Ll})  # match a lower case letter in a positive lookbehind
    (?=\p{Lu})   # match an upper case letter in a positive lookahead
    /x           # free-spacing regex definition mode

str = "theQuick--brown_dog, JumpsOver___the.--lazy   FOX for $5"

str.split(r).map(&:downcase)
  #=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
       "fox", "for", "$5"]

If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

如果字符串在空格和所有标点符号上被破坏，请替换[- _]。与[[:punct:]]]+ +。在Regexp上搜索“[[:punct:]]”以供参考。

#1