如何在Scala中拆分String但保持部件与正则表达式匹配?

时间:2021-12-25 02:08:14

My question is the same as Split string including regular expression match but for Scala. Unfortunately, the JavaScript solution doesn't work in Scala.

我的问题与Split字符串相同,包括正则表达式匹配,但对于Scala。不幸的是,JavaScript解决方案在Scala中不起作用。

I am parsing some text. Let's say I have some string:

我正在解析一些文字。假设我有一些字符串:

"hello wold <1> this is some random text <3> foo <12>"

I would like to get the following Seq: "hello world" :: "<1>" :: "this is some random text" :: "<3>" :: "foo" :: "<12>".

我想获得以下Seq:“hello world”::“<1>”::“这是一些随机文本”::“<3>”::“foo”::“<12>”。

Note that I am spliting the string whenever I encounter a <"number"> sequence.

请注意,每当遇到<“number”>序列时,我都会分割字符串。

2 个解决方案

#1


5  

val s = "hello wold <1> this is some random text <3> foo <12>"
s: java.lang.String = hello wold <1> this is some random text <3> foo <12>

s.split("""((?=<\d{1,3}>)|(?<=<\d{1,3}>))""")
res0: Array[java.lang.String] = Array(hello wold , <1>,  this is some random text , <3>,  foo , <12>)

Did you actually try out your edit? Having \d+ doesn't work. See this question.

你真的试过你的编辑吗?让\ d +不起作用。看到这个问题。

s.split("""((?=<\d+>)|(?<=<\d+>))""")
java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 19

#2


1  

Here's a quick, but a little hacky solution:

这是一个快速但有点hacky的解决方案:

scala> val str = "hello wold <1> this is some random text <3> foo <12>"
str: String = hello wold <1> this is some random text <3> foo <12>

scala> str.replaceAll("<\\d+>", "_$0_").split("_")
res0: Array[String] = Array("hello wold ", <1>, " this is some random text ", <3>, " foo ", <12>)

Of course, the problem with this solution is that I gave the underscore character a special meaning. If it occurs naturally in the original string, you'll get bad results. So you have to either choose another magic character sequence for which you are sure that it won't occur in the original string or play with some more escaping/unescaping.

当然,这个解决方案的问题在于我给下划线字符赋予了特殊的含义。如果它在原始字符串中自然出现,则会得到错误的结果。因此,您必须选择另一个魔术字符序列,您确定它不会出现在原始字符串中,或​​者使用更多的转义/取消转义。

Another solution involves usage of lookahead and lookbehind patterns, as described in this question.

另一种解决方案涉及使用前瞻和后视模式,如本问题所述。

#1


5  

val s = "hello wold <1> this is some random text <3> foo <12>"
s: java.lang.String = hello wold <1> this is some random text <3> foo <12>

s.split("""((?=<\d{1,3}>)|(?<=<\d{1,3}>))""")
res0: Array[java.lang.String] = Array(hello wold , <1>,  this is some random text , <3>,  foo , <12>)

Did you actually try out your edit? Having \d+ doesn't work. See this question.

你真的试过你的编辑吗?让\ d +不起作用。看到这个问题。

s.split("""((?=<\d+>)|(?<=<\d+>))""")
java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 19

#2


1  

Here's a quick, but a little hacky solution:

这是一个快速但有点hacky的解决方案:

scala> val str = "hello wold <1> this is some random text <3> foo <12>"
str: String = hello wold <1> this is some random text <3> foo <12>

scala> str.replaceAll("<\\d+>", "_$0_").split("_")
res0: Array[String] = Array("hello wold ", <1>, " this is some random text ", <3>, " foo ", <12>)

Of course, the problem with this solution is that I gave the underscore character a special meaning. If it occurs naturally in the original string, you'll get bad results. So you have to either choose another magic character sequence for which you are sure that it won't occur in the original string or play with some more escaping/unescaping.

当然,这个解决方案的问题在于我给下划线字符赋予了特殊的含义。如果它在原始字符串中自然出现,则会得到错误的结果。因此,您必须选择另一个魔术字符序列,您确定它不会出现在原始字符串中,或​​者使用更多的转义/取消转义。

Another solution involves usage of lookahead and lookbehind patterns, as described in this question.

另一种解决方案涉及使用前瞻和后视模式,如本问题所述。