在bash中使用Regex删除XML注释。

时间:2023-02-01 00:25:42

I want to remove XML comments in bash using regex (awk, sed, grep...) I have looked at other questions about this but they are missing something. Here's my xml code

我想用regex (awk, sed, grep…)来删除bash中的XML注释。我看了其他的问题,但是他们漏掉了一些东西。这是我的xml代码

<Table>
    <!--
   to be removed bla bla bla bla bla bl............

    removeee

    to be removeddddd
    -->

<row>
        <column name="example"  value="1" ></column>
    </row>
</Table>

So I'm comparing 2 xml files but I don't want the comparison to take into account the comments. I do this

我在比较两个xml文件,但是我不希望比较考虑注释。我这样做

diff file1.xml file2.xml | sed '/<!--/,/-->/d'

but that only removes the line that starts with <!-- and the last line. It does not remove all the lines in between.

但这只会删除以 <开头的行!——最后一行。它不会删除中间的所有行。< p>

4 个解决方案

#1


5  

In the end, you're going to have to recommend to your client/friend/instructor that they need to install some kind of XML processor. xmlstarlet is a good command line tool, but there are any number (or at least some number greater than 2) of implementations of XSLT which can be compiled for any standard Unix, and in most cases also for Windows. You really cannot do much XML processing with regex-based tools, and whatever you do will be hard to read, harder to maintain, and likely to fail on corner cases, sometimes with disastrous consequences.

最后,您将不得不向您的客户/朋友/讲师推荐他们需要安装某种XML处理器。xmlstarlet是一个很好的命令行工具,但是可以为任何标准Unix和大多数情况下也可以为Windows编译任何数量(或至少大于2个)的XSLT实现。您确实不能使用基于regex的工具进行太多的XML处理,而且无论您做什么,都将难于阅读、难于维护,并且很可能在紧急情况下失败,有时会带来灾难性的后果。

I haven't spent a lot of time polishing or reviewing the following little awk program. I think it will remove comments from compliant xml documents. Note that the following comment is not compliant:

我还没有花很多时间润色或检查下面的小awk程序。我认为它将从兼容的xml文档中删除注释。注意,以下评论不符合要求:

<!-- XML comments cannot include -- so this comment is illegal -->

and it will not be treated correctly by my script.

我的脚本不会正确地处理它。

The following is also illegal, but since I've seen it in the wild and it wasn't hard to deal with, I did so:

以下也是违法的,但我在野外看到过它,而且并不难处理,我这样做了:

<!-------------- This comment is ill-formed but... -------------->

Here it is. No guarantees. I know that it's hard to read, and I wouldn't want to maintain it. It may well fail on arbitrary corner cases.

在这儿。是没有保证的。我知道这很难读,我也不想继续读下去。它很可能会在任意情况下失败。

awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
     in_comment{next}
     {gsub(/<!--+([^-]|-[^-])*--+>/,"");
      in_comment=sub(/<!--+.*/,"");
      print}'

#2


2  

xmlstarlet ed -d '//comment()' file.xml

#3


2  

The most simple solution to remove all comments from a textfile I could come up with is:

从我能想到的文本文件中删除所有注释的最简单的解决方案是:

sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'

To explain:

解释:

The sed will put in a null char like this:

sed将输入一个空字符,如下所示:

<Table>
    \0<!--
   to be removed bla bla bla bla bla bl............

    removeee

    to be removeddddd
    -->\0

<row>
        <column name="example"  value="1" ></column>
    </row>
</Table>

than the grep -z will treat that character as "line seperator" and remove the middle part, and finally tr -d will remove the \0 again.

而grep -z将该字符视为“行分隔符”并删除中间部分,最后tr -d将再次删除\0。

in this case it should be applied to both files before comparing e.g.:

在这种情况下,在比较之前,它应该被应用到两个文件中。

diff <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file1.xml | grep -zv '^<!--' | tr -d '\0') <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file2.xml | grep -zv '^<!--' | tr -d '\0')

or more readable with a function:

或更具可读性的函数:

stripcomments() {sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'}

diff <(cat file1.xml | stripcomments) <(cat file1.xml | stripcomments)

There are some issues with CDATA blocks, as they can be used to have unbalanced comments, and there is a higher probability of them having important null-characters. But for most valid xml-files this should work.

CDATA块存在一些问题,因为它们可以使用不平衡的注释,并且它们拥有重要的空字符的可能性更高。但是对于大多数有效的xml文件来说,这应该是可行的。

#4


0  

You can use the pair 'perl-xmllint' to get this job done :

您可以使用一对'perl-xmllint'来完成这项工作:

cat yourFile.xml | perl -e 'while (<>) { next if (/Start.*End/ );if (/Start/) { while (<>) {last if (/End/) }}else {print "$_"; }} ' | xmllint --format -

With Start = Your starting comment (in our case <!--) End = Your ending comment (in our case -->)

Start =你的开始注释(在我们的例子中是

I tried to use grep -vP without any good results because I did not find how to tell grep to understand the dot as new lines (the s modifier).

我尝试使用grep -vP而没有任何好的结果,因为我没有找到如何告诉grep将点理解为新行(s修饰符)。

#1


5  

In the end, you're going to have to recommend to your client/friend/instructor that they need to install some kind of XML processor. xmlstarlet is a good command line tool, but there are any number (or at least some number greater than 2) of implementations of XSLT which can be compiled for any standard Unix, and in most cases also for Windows. You really cannot do much XML processing with regex-based tools, and whatever you do will be hard to read, harder to maintain, and likely to fail on corner cases, sometimes with disastrous consequences.

最后,您将不得不向您的客户/朋友/讲师推荐他们需要安装某种XML处理器。xmlstarlet是一个很好的命令行工具,但是可以为任何标准Unix和大多数情况下也可以为Windows编译任何数量(或至少大于2个)的XSLT实现。您确实不能使用基于regex的工具进行太多的XML处理,而且无论您做什么,都将难于阅读、难于维护,并且很可能在紧急情况下失败,有时会带来灾难性的后果。

I haven't spent a lot of time polishing or reviewing the following little awk program. I think it will remove comments from compliant xml documents. Note that the following comment is not compliant:

我还没有花很多时间润色或检查下面的小awk程序。我认为它将从兼容的xml文档中删除注释。注意,以下评论不符合要求:

<!-- XML comments cannot include -- so this comment is illegal -->

and it will not be treated correctly by my script.

我的脚本不会正确地处理它。

The following is also illegal, but since I've seen it in the wild and it wasn't hard to deal with, I did so:

以下也是违法的,但我在野外看到过它,而且并不难处理,我这样做了:

<!-------------- This comment is ill-formed but... -------------->

Here it is. No guarantees. I know that it's hard to read, and I wouldn't want to maintain it. It may well fail on arbitrary corner cases.

在这儿。是没有保证的。我知道这很难读,我也不想继续读下去。它很可能会在任意情况下失败。

awk 'in_comment&&/-->/{sub(/([^-]|-[^-])*--+>/,"");in_comment=0}
     in_comment{next}
     {gsub(/<!--+([^-]|-[^-])*--+>/,"");
      in_comment=sub(/<!--+.*/,"");
      print}'

#2


2  

xmlstarlet ed -d '//comment()' file.xml

#3


2  

The most simple solution to remove all comments from a textfile I could come up with is:

从我能想到的文本文件中删除所有注释的最简单的解决方案是:

sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'

To explain:

解释:

The sed will put in a null char like this:

sed将输入一个空字符,如下所示:

<Table>
    \0<!--
   to be removed bla bla bla bla bla bl............

    removeee

    to be removeddddd
    -->\0

<row>
        <column name="example"  value="1" ></column>
    </row>
</Table>

than the grep -z will treat that character as "line seperator" and remove the middle part, and finally tr -d will remove the \0 again.

而grep -z将该字符视为“行分隔符”并删除中间部分,最后tr -d将再次删除\0。

in this case it should be applied to both files before comparing e.g.:

在这种情况下,在比较之前,它应该被应用到两个文件中。

diff <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file1.xml | grep -zv '^<!--' | tr -d '\0') <(sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' file2.xml | grep -zv '^<!--' | tr -d '\0')

or more readable with a function:

或更具可读性的函数:

stripcomments() {sed 's/<!--/\x0<!--/g;s/-->/-->\x0/g' | grep -zv '^<!--' | tr -d '\0'}

diff <(cat file1.xml | stripcomments) <(cat file1.xml | stripcomments)

There are some issues with CDATA blocks, as they can be used to have unbalanced comments, and there is a higher probability of them having important null-characters. But for most valid xml-files this should work.

CDATA块存在一些问题,因为它们可以使用不平衡的注释,并且它们拥有重要的空字符的可能性更高。但是对于大多数有效的xml文件来说,这应该是可行的。

#4


0  

You can use the pair 'perl-xmllint' to get this job done :

您可以使用一对'perl-xmllint'来完成这项工作:

cat yourFile.xml | perl -e 'while (<>) { next if (/Start.*End/ );if (/Start/) { while (<>) {last if (/End/) }}else {print "$_"; }} ' | xmllint --format -

With Start = Your starting comment (in our case <!--) End = Your ending comment (in our case -->)

Start =你的开始注释(在我们的例子中是

I tried to use grep -vP without any good results because I did not find how to tell grep to understand the dot as new lines (the s modifier).

我尝试使用grep -vP而没有任何好的结果,因为我没有找到如何告诉grep将点理解为新行(s修饰符)。