是否有Linux命令行实用程序从XML文件中删除部分(不确定这是否是正确的术语)?

时间:2023-01-24 12:14:39

I am trying to do some manipulation of an XMLTV format file that contains TV schedule information. Within the file are sections that look like this:

我正在尝试对包含电视节目表信息的XMLTV格式文件进行一些操作。在文件中是如下所示的部分:

  <programme start="20141215220000 -0500" stop="20141216060000 -0500" channel="someid.someaddress.com">
    <title lang="en">Local Programming</title>
    <length units="hours">1</length>
    <episode-num system="common">S00E00</episode-num>
    <episode-num system="dd_progid">SH00019112.0000</episode-num>
    <previously-shown />
  </programme>

As you can see the second line contains this:

如您所见,第二行包含:

    <title lang="en">Local Programming</title>

What I would like to find is some kind of command line utility that runs in Linux, that can look for that specific line and if it exists, remove everything between and including the programme tags. I am not very familiar with XML files so I don't know if there is a specific name for a block of data such as this, but I just want to remove that entire section whenever the title is "Local Programming".

我想要找到的是某种在Linux中运行的命令行实用程序,可以查找该特定行,如果存在,则删除程序标记之间的所有内容。我对XML文件不是很熟悉,所以我不知道是否有一个特定的数据块名称,但是我只想在标题是“本地编程”时删除整个部分。

It would actually work better for my purposes if I could remove the block only when the title is "Local Programming" AND the channel value in the first line is a certain specific value, since I only need to remove these for a specific channel, but it would not hurt anything to remove all of the "Local Programming" blocks on any channel, and to look for two values would probably make this a much more difficult problem. It has to be a command line utility because it will be called from a short shell script.

如果我只能在标题是“本地编程”并且第一行中的通道值是某个特定值时才删除块,那么它实际上会更好地用于我的目的,因为我只需要为特定通道删除这些,但是删除任何通道上的所有“本地编程”块都不会有任何损害,并且查找两个值可能会使这个问题变得更加棘手。它必须是命令行实用程序,因为它将从短shell脚本调用。

Basically I'm just trying to identify the best tool for the job. I'm not a programmer (unless you count making a bash shell script of a few lines, that just runs several things sequentially, as programming) so I'd like to stick with an existing command line tool if possible, but I'm not adverse to pulling in something new with apt-get either. Any suggestions?

基本上我只是想找出最适合这份工作的工具。我不是程序员(除非你算上几行的bash shell脚本,只是顺序运行几个东西,作为编程)所以我想尽可能坚持现有的命令行工具,但我是用apt-get拉动新东西也不会有害。有什么建议?

EDIT: What worked was the xmlstarlet tool suggested by Charles Duffy, but only if I did not attempt to use the --var option and instead specified the values directly. For example, this removed all blocks with the title "Local Programming" from a file xmltv.xml:

编辑:有效的是Charles Duffy建议的xmlstarlet工具,但前提是我没有尝试使用--var选项而是直接指定值。例如,这从文件xmltv.xml中删除了标题为“Local Programming”的所有块:

xmlstarlet ed --delete "//programme[title='Local Programming']" <xmltv.xml >newfile.xml

And if I want to remove the block only when the title is "Local Programming" AND the channel value in the first line is a certain specific value, then it appears that this works:

如果我只想在标题是“本地编程”时删除块,并且第一行中的通道值是某个特定值,那么它似乎有效:

xmlstarlet ed --delete "//programme[title='Local Programming'][@channel='someid.someaddress.com']" <xmltv.xml >newfile.xml

This is exactly what I was looking for, so I consider the problem solved. Thank you to all who replied.

这正是我所寻找的,所以我认为问题已经解决了。感谢所有回复的人。

2 个解决方案

#1


5  

To delete any program having both the English-language title Local Programming and the channel someid.someaddress.com:

删除任何同时具有英文标题Local Programming和channel someid.someaddress.com的程序:

xmlstarlet ed \
  --var chan "'someid.someaddress.com'" \
  --var name "'Local Programming'" \
  --delete '//programme[title[@lang="en"]=$name][@channel=$chan]' \
  <in.xml >out.xml && mv out.xml in.xml

If you're targeting an older XMLStarlet release, you may need to do the substitutions yourself -- using "Local Programming" in place of $name and "someid.someaddress.com" in place of $chan -- but the above is known to work against the 1.5.0 release.

如果您的目标是较旧的XMLStarlet版本,您可能需要自己进行替换 - 使用“本地编程”代替$ name,使用“someid.someaddress.com”代替$ chan - 但上述内容已知反对1.5.0版本。

This requires the tool XMLStarlet, which should be available for installation in your distribution vendor's repository.

这需要工具XMLStarlet,它应该可以在您的分发供应商的存储库中安装。

Note that you didn't show your document's namespace declarations -- if xmlns='...' has been specified in a parent, some adjustment may be called for.

请注意,您没有显示文档的命名空间声明 - 如果在父级中指定了xmlns ='...',则可能会调用某些调整。

#2


2  

In addition to the proper XML handling, as exemplified in the other answer, one can always resort to the old-fashioned way: by handling the XML as plain text. In Perl:

除了正确的XML处理之外,如另一个答案所示,人们总是可以采用过时的方式:将XML作为纯文本处理。在Perl中:

cat fancy.xml |
perl -ne 'BEGIN{$/=undef;} print grep { /^<programme/ ? !m{<title\s+lang="en">Local\s+Programming</title>} : 1 } split qr{(<programme.*?</programme>)}s'

That reads the whole input XML (by resetting the input record separator), cuts it into the flat list of program blocks and everything going between them (the split()), and then filters out the program blocks which have the sought string present in them (the grep()).

它读取整个输入XML(通过重置输入记录分隔符),将其切换到程序块的平面列表和它们之间的所有内容(split()),然后筛选出具有所需字符串的程序块他们(grep())。

#1


5  

To delete any program having both the English-language title Local Programming and the channel someid.someaddress.com:

删除任何同时具有英文标题Local Programming和channel someid.someaddress.com的程序:

xmlstarlet ed \
  --var chan "'someid.someaddress.com'" \
  --var name "'Local Programming'" \
  --delete '//programme[title[@lang="en"]=$name][@channel=$chan]' \
  <in.xml >out.xml && mv out.xml in.xml

If you're targeting an older XMLStarlet release, you may need to do the substitutions yourself -- using "Local Programming" in place of $name and "someid.someaddress.com" in place of $chan -- but the above is known to work against the 1.5.0 release.

如果您的目标是较旧的XMLStarlet版本,您可能需要自己进行替换 - 使用“本地编程”代替$ name,使用“someid.someaddress.com”代替$ chan - 但上述内容已知反对1.5.0版本。

This requires the tool XMLStarlet, which should be available for installation in your distribution vendor's repository.

这需要工具XMLStarlet,它应该可以在您的分发供应商的存储库中安装。

Note that you didn't show your document's namespace declarations -- if xmlns='...' has been specified in a parent, some adjustment may be called for.

请注意,您没有显示文档的命名空间声明 - 如果在父级中指定了xmlns ='...',则可能会调用某些调整。

#2


2  

In addition to the proper XML handling, as exemplified in the other answer, one can always resort to the old-fashioned way: by handling the XML as plain text. In Perl:

除了正确的XML处理之外,如另一个答案所示,人们总是可以采用过时的方式:将XML作为纯文本处理。在Perl中:

cat fancy.xml |
perl -ne 'BEGIN{$/=undef;} print grep { /^<programme/ ? !m{<title\s+lang="en">Local\s+Programming</title>} : 1 } split qr{(<programme.*?</programme>)}s'

That reads the whole input XML (by resetting the input record separator), cuts it into the flat list of program blocks and everything going between them (the split()), and then filters out the program blocks which have the sought string present in them (the grep()).

它读取整个输入XML(通过重置输入记录分隔符),将其切换到程序块的平面列表和它们之间的所有内容(split()),然后筛选出具有所需字符串的程序块他们(grep())。