如何在perl中拆分字符串，保留分隔符，并在分隔符之间进行拆分？

My question is a little wordy so I'll try to explain with an example.

我的问题有点罗嗦，所以我试着用一个例子来解释。

I have a file that's somewhat similar to XML that I need to parse, though not exactly. Elements in the file generally show up similar to XML format, like

我有一个类似于XML的文件，我需要解析，但不完全是。文件中的元素通常显示类似于XML格式，如

<person><greeting>hello</greeting><goodbye>bye</goodbye></person>

I wanted to split up the file into individual sets of tags, so that one element would be

我想将文件分成单独的标签集，以便一个元素

<greeting>hello</greeting>

and another would be

而另一个是

<goodbye>bye</goodbye>

Naturally for an empty element, <person> and </person> will end up being their own elements, I'm completely OK with that because of how I want to parse the file as a whole.

当然对于一个空元素，和最终将成为他们自己的元素，因为我想要整体解析文件，所以我完全可以。

The issue I'm running into is how best to split the whole file into an array, because there's no newlines at all in the file, it's written out as you see it. I tried doing it like this

我遇到的问题是如何最好地将整个文件拆分成一个数组，因为文件中根本没有新行，它就像你看到的那样写出来了。我试着这样做

my @array = split(/(><)/, $file)

but the issue is that it doesn't preserve the angle braces as a part of the associated tag, but separates them out. Is there a way for me to split the file between the > and < characters?

但问题是它不会将角度括号保留为相关标签的一部分，而是将它们分开。有没有办法在>和 <字符之间拆分文件？< p>

1 个解决方案

#1

I am not sure if this is the best solution, but to answer your question directly, you can split between the angles using lookbehind and lookahead assertions.

我不确定这是否是最好的解决方案，但是直接回答你的问题，你可以使用lookbehind和lookahead断言在角度之间进行分割。

my @array = split(/(?<=>)(?=<)/, $file)

The difference is. that they do not consume the >< part, they match the position in between.

不同的是。他们不消耗> ，它们匹配两者之间的位置。

Another idea would be to use a backreference to match the correct (it matches the first closing tag with this name, that is wrong when identical tags are nested) closing tag, something like this

另一个想法是使用反向引用来匹配正确的（它匹配第一个结束标记与此名称，这是相同的标记嵌套时错误）结束标记，这样的事情

<([^>]*)>(.*?)</\1>

See it here on Regexr

在Regexr上看到它

You have two references in this regex. The first is used to match the closing tag, and in the second you will find the content of the tag.

这个正则表达式中有两个引用。第一个用于匹配结束标记，第二个用于匹配标记的内容。

Of course it will match at first the "person" tag, but you will find the other tags in $2. You would have to use the regex recursively on $2 till the result is an empty array.

当然它首先匹配“person”标签，但你会发现$ 2中的其他标签。您必须在$ 2上递归使用正则表达式，直到结果为空数组。

#1