将grep regex多行搜索的匹配结果限制为1。

I've got some text files partly containing XML data. E.g.:

我有一些部分包含XML数据的文本文件。例如:

<soap:Envelope xmlns:soap="..."><soap:Body><Data><SpecificTag>Some
multiline data
that I need to
extract.
</SpecificTag></Data></soap:Body></soap:Envelope>

I need to do multi-line search and extract only the data within specific tags. I tried a few solutions found here, and the best result I got using grep in perl-regexp mode:

我需要进行多行搜索，只提取特定标记中的数据。我在这里尝试了一些解决方案，在perl-regexp模式下使用grep得到的最佳结果是:

grep -Pzo '(?s)<SpecificTag>\K.*?(?=</SpecificTag>)' filename

But sometimes file may contain two or more identical blocks with matching patterns. How can I change this regular expression to limit the grep output to first occurrence? -m argument does not work in case of perl regex mode.

但有时文件可能包含两个或多个具有匹配模式的相同块。如何更改这个正则表达式以将grep输出限制为第一次出现?-m参数在perl regex模式下不起作用。

p.s.: Other working solutions are okay, but using XML-specific tools is not an option. Files are actually memory dumps infiltrated with strings utility, they contain only fragments of SOAP transactions among other data. I have to use regex in this case.

注。:其他可用的解决方案是可以的，但是使用特定于xml的工具不是一个选项。文件实际上是被string实用程序渗透的内存转储，它们在其他数据中只包含SOAP事务的片段。在这种情况下，我必须使用regex。

2 个解决方案

#1

Here's something for sed:

这儿有个对话:

/<SpecificTag>/,/<\/SpecificTag>/ {
  /<SpecificTag>/ {
    s/.*<SpecificTag>//
  }
  /<\/SpecificTag>/ {
    s/<\/SpecificTag>.*//
    p
    q
  }
  p
}

Put that in a file, say foo.sed, and use sed -n -f foo.sed filename.xml.

把它放到一个文件中，比如说foo。使用sed -n -f foo。sed filename.xml。

The way this works is as follows:

其工作方式如下:

/<SpecificTag>/,/<\/SpecificTag>/ {

means that all this only happens for lines between <SpecificTag> and </SpecificTag>.

这意味着所有这一切只发生在 <特定标签> 和之间的行。

  /<SpecificTag>/ {
    s/.*<SpecificTag>//
  }

means that within that constraint, the line containing <SpecificTag> has it and everything before it removed.

意味着在这个约束中，包含的行在删除之前拥有它和所有内容。

  /<\/SpecificTag>/ {
    s/<\/SpecificTag>.*//
    p
    q
  }

means that the line containing </SpecificTag> has it and everything after it removed, is printed, and then sed quits. This is how only the first match is extracted.

表示包含的行有它，以及它被删除后的所有内容被打印出来，然后退出。这是只提取第一个匹配的方法。

p
}

means that all other lines within the first constraint (between the tags) are printed. This includes the rest of the first line after the substitution.

意味着在第一个约束(标记之间)的所有其他行都被打印。这包括替换后的第一行的其余部分。

If you prefer to have it in one long command:

如果你喜欢把它放在一个长命令中:

sed -n -e '/<SpecificTag>/,/<\/SpecificTag>/ { /<SpecificTag>/ { s/.*<SpecificTag>// }; /<\/SpecificTag>/ { s/<\/SpecificTag>.*//; p; q }; p }' filename.xml

...but of course that makes it harder to see what is happening, and sed scripts are already notoriously difficult to read.

…但当然，这让我们很难看到发生了什么，而sed脚本已经是出了名的难读。

Addendum: An addition you may want to consider is to make

附录:你可能想要考虑的添加是

  /<\/SpecificTag>/ {
    s/<\/SpecificTag>.*//
    p
    q
  }

into

成

  /<\/SpecificTag>/ {
    s/<\/SpecificTag>.*//
    /^$/ !p
    q
  }

or perhaps even with

或者即使有

    /^ *$/ !p

...in which case the remainder of the line containing </SpecificTag> will only be printed if it is not empty (first version) or contains more than spaces (second version). This prevents (possibly) superfluous line breaks at the end of the extracted text.

…在这种情况下，只有当不为空(第一个版本)或包含多于空格(第二个版本)时，才会打印包含的行其余部分。这可以防止(可能)在提取的文本末尾出现多余的换行符。

#2

You need to use \A anchor to match start of the very first line.

你需要使用一个锚来匹配第一行的开始。

grep -Pzo '(?s)\A.*?<SpecificTag>\K.*?(?=</SpecificTag>)' file

Example:

例子:

$ cat file
<soap:Envelope xmlns:soap="..."><soap:Body><Data><SpecificTag>Some
multiline first data
that I need to
extract.
</SpecificTag></Data></soap:Body></soap:Envelope>
<SpecificTag>Some
multiline second data
that I need to
extract.

$ grep -Pzo '(?s)\A.*?<SpecificTag>\K.*?(?=</SpecificTag>)' file
Some
multiline first data
that I need to
extract.

或

grep -Pzo '(?s)\A.*?<SpecificTag>\K(?:(?!</?SpecificTag>).)*(?=</SpecificTag>)' file

#1