使用groovy更新xml文件时保留格式

I have a large number of XML files that contain URLs. I'm writing a groovy utility to find each URL and replace it with an updated version.

我有大量包含URL的XML文件。我正在编写一个groovy实用程序来查找每个URL并将其替换为更新版本。

Given example.xml:

给定example.xml：

<?xml version="1.0" encoding="UTF-8"?>
<page>
    <content>
        <section>
            <link>
                <url>/some/old/url</url>
            </link>
            <link>
                <url>/some/old/url</url>
            </link>
        </section>
        <section>
            <link>
                <url>
                    /a/different/old/url?with=specialChars&amp;escaped=true
                </url>
            </link>
        </section>
    </content>
</page>

Once the script has run, example.xml should contain:

脚本运行后，example.xml应包含：

<?xml version="1.0" encoding="UTF-8"?>
<page>
    <content>
        <section>
            <link>
                <url>/a/new/and/improved/url</url>
            </link>
            <link>
                <url>/a/new/and/improved/url</url>
            </link>
        </section>
        <section>
            <link>
                <url>
                    /a/different/new/and/improved/url?with=specialChars&amp;stillEscaped=true
                </url>
            </link>
        </section>
    </content>
</page>

This is easy to do using groovy's excellent xml support, except that I want to change the URLs and nothing else about the file.

使用groovy优秀的xml支持很容易做到这一点，除了我想要更改URL以及文件的其他内容。

By that I mean:

我的意思是：

whitespace must not change (files might contain spaces, tabs, or both)
空格不得更改（文件可能包含空格，制表符或两者）
comments must be preserved
必须保留评论
windows vs. unix-style line separators must be preserved
必须保留windows与unix样式的行分隔符
the xml declaration at the top must not be added or removed
不得添加或删除顶部的xml声明
attributes in tags must retain their order
标签中的属性必须保留其顺序

So far, after trying many combinations of XmlParser, DOMBuilder, XmlNodePrinter, XmlUtil.serialize(), and so on, I've landed on reading each file line-by-line and applying an ugly hybrid of the xml utilities and regular expressions.

到目前为止，在尝试了XmlParser，DOMBuilder，XmlNodePrinter，XmlUtil.serialize（）等的许多组合之后，我已经逐行阅读每个文件，并应用了xml实用程序和正则表达式的丑陋混合。

Reading and writing each file:

读写每个文件：

files.each { File file ->
    def lineEnding = file.text.contains('\r\n') ? '\r\n' : '\n'
    def newLineAtEof = file.text.endsWith(lineEnding)
    def lines = file.readLines()
    file.withWriter { w ->
        lines.eachWithIndex { line, index ->
            line = update(line)
            w.write(line)
            if (index < lines.size-1) w.write(lineEnding)
            else if (newLineAtEof) w.write(lineEnding)
        }
    }
}

Searching for and updating URLs within a line:

搜索和更新一行中的URL：

def matcher = (line =~ urlTagRegexp) //matches a <url> element and its contents
matcher.each { groups ->
    def urlNode = new XmlParser().parseText(line)
    def url = urlNode.text()
    def newUrl = translate(url)
    if (newUrl) {
        urlNode.value = newUrl
        def replacement = nodeToString(urlNode)
        line = matcher.replaceAll(replacement)
    }
}

def nodeToString(node) {
    def writer = new StringWriter()
    writer.withPrintWriter { printWriter ->
        def printer = new XmlNodePrinter(printWriter)
        printer.preserveWhitespace = true
        printer.print(node)
    }
    writer.toString().replaceAll(/[\r\n]/, '')
}

This mostly works, except it can't handle a tag split over multiple lines, and messing with newlines when writing the files back out is cumbersome.

这主要是有效的，除了它不能处理分割成多行的标签，并且在将文件写回时弄乱换行是很麻烦的。

I'm new to groovy, but I feel like there must be a groovier way of doing this.

我是groovy的新手，但我觉得必须有一种更加时髦的方式来做这件事。

2 个解决方案

#1

I just created gist at: https://gist.github.com/akhikhl/8070808 to demonstrate how such transformation is done with Groovy and JDOM2.

我刚刚在https://gist.github.com/akhikhl/8070808上创建了gist，以演示如何使用Groovy和JDOM2完成此类转换。

Important notes:

重要笔记：

Groovy technically allows using any java libraries. If something cannot be done with Groovy JDK, it can be done with other library.
Groovy在技术上允许使用任何java库。如果使用Groovy JDK无法完成某些操作，可以使用其他库完成。
jaxen library (implementing XPath) should be included explicitly (via @Grab or via maven/gradle), since it's an optional dependency of JDOM2.
应该明确地包含jaxen库（实现XPath）（通过@Grab或通过maven / gradle），因为它是JDOM2的可选依赖项。
The sequence of @Grab/@GrabExclude instructions fixes the quirky dependence of jaxen on JDOM-1.0.
@ Grab / @ GrabExclude指令的序列修复了jaxen对JDOM-1.0的古怪依赖性。
XPathFactory.compile also supports variable binding and filters (see online javadoc).
XPathFactory.compile还支持变量绑定和过滤器（参见在线javadoc）。
XPathExpression (which is returned by compile) supports two major functions - evaluate and evaluateFirst. evaluate always returns a list of all XML-nodes, satisfying the specified predicate, while evaluateFirst returns just the first matching XML-node.
XPathExpression（由compile返回）支持两个主要功能 - evaluate和evaluateFirst。 evaluate总是返回所有XML节点的列表，满足指定的谓词，而evaluateFirst只返回第一个匹配的XML节点。

Update

更新

The following code:

以下代码：

new XMLOutputter().with {
  format = Format.getRawFormat()
  format.setLineSeparator(LineSeparator.NONE)
  output(doc, System.out)
}

solves a problem with preserving whitespaces and line separators. getRawFormat constructs a format object that preserves whitespaces. LineSeparator.NONE instructs format object, that it should not convert line separators.

解决了保留空格和行分隔符的问题。 getRawFormat构造一个保留空格的格式对象。 LineSeparator.NONE指示格式对象，它不应转换行分隔符。

The gist mentioned above contains this new code as well.

上面提到的要点也包含这个新代码。

#2

There is a solution without any 3rd party library.

有一个没有任何第三方库的解决方案。

def xml = file.text
def document = groovy.xml.DOMBuilder.parse(new StringReader(xml))
def root = document.documentElement
use(groovy.xml.dom.DOMCategory) {
    // manipulate the XML here, i.e. root.someElement?.each { it.value = 'new value'}
}

def result = groovy.xml.dom.DOMUtil.serialize(root)

file.withWriter { w ->
    w.write(result)
}

Taken from http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html

摘自http://jonathan-whywecanthavenicethings.blogspot.de/2011/07/keep-your-hands-off-of-my-whitespace.html

#1