I have an XML document that has HTML tags included:

我有一个包含HTML标签的XML文档:

<chapter>
      <h1>title of content</h1>
      <p> my paragraph ... </p>
 </chapter>

I need to get the content of <chapter> tag and my output will be:

我需要获取标签的内容,我的输出将是:

      <h1>title of content</h1>
      <p> my paragraph ... </p>

My question is similar to this post: How parse XML to get one tag and save another tag inside

我的问题类似于这篇文章:如何解析XML以获取一个标记并在其中保存另一个标记

But I need to implement it in Java using SAX or DOM or ...?

但是我需要使用SAX或DOM或者......在Java中实现它?

I found a soluton using SAX in this post: SAX Parser : Retrieving HTML tags from XML but it's very buggy and doesn't work with large amounts of XML data.

我在这篇文章中找到了一个使用SAX的解决方案:SAX Parser:从XML中检索HTML标签,但它非常错误,不适用于大量的XML数据。

Updated:

My SAX implementation: In some situation it throw exception: java.lang.StringIndexOutOfBoundsException: String index out of range: -4029

我的SAX实现:在某些情况下它抛出异常:java.lang.StringIndexOutOfBoundsException:字符串索引超出范围:-4029

public class MyXMLHandler extends DefaultHandler {

private boolean tagFlag = false;

private char[] temp;
String insideTag;
private int startPosition;
private int endPosition;
private String tag;

public void startElement(String uri, String localName, String qName,
        Attributes attributes) throws SAXException {


    if (qName.equalsIgnoreCase(tag)) {
        tagFlag = true;
    }

}

public void endElement(String uri, String localName, String qName)
        throws SAXException {

    if (qName.equalsIgnoreCase(tag)) {

        insideTag = new String(temp, startPosition, endPosition - startPosition);
        tagFlag = false;
    }

}

public void characters(char ch[], int start, int length)
        throws SAXException {
    temp = ch;
    if (tagFlag) {
        startPosition = start;
        tagFlag = false;
    }
    endPosition = start + length;
}

public String getInsideTag(String tag) {
    this.tag = tag;
    return insideTag;
}

}

Update 2: (Using StringBuilder)

I have accumulated characters by StringBuilder in this way:

我用StringBuilder以这种方式累积了字符:

public class MyXMLHandler extends DefaultHandler {

private boolean tagFlag = false;

private char[] temp;
String insideTag;
private String tag;
private StringBuilder builder;

public void startElement(String uri, String localName, String qName,
        Attributes attributes) throws SAXException {

    if (qName.equalsIgnoreCase(tag)) {
        builder = new StringBuilder();
        tagFlag = true;
    }

}

public void endElement(String uri, String localName, String qName)
        throws SAXException {

    if (qName.equalsIgnoreCase(tag)) {
        insideTag = builder.toString();
        tagFlag = false;
    }
}

public void characters(char ch[], int start, int length)
        throws SAXException {
    if (tagFlag) {
        builder.append(ch, start, length);
    }
}

public String getInsideTag(String tag) {
    this.tag = tag;
    return insideTag;
}

}

But builder.append(ch, start, length); doesn't append Start tag like<EmbeddedTag atr="..."> and </EmbeddedTag> in the Buffer. This Code print Output:

但是builder.append(ch,start,length);不会在缓冲区中追加开始标记,如和。本代码打印输出:

      title of content
      my paragraph ...

Instead of expected output:

而不是预期的输出:

      <h1>title of content</h1>
      <p> my paragraph ... </p>

Update 3:

Finally I have implemented the parser handler:

最后我实现了解析器处理程序:

 public class MyXMLHandler extends DefaultHandler {

private boolean tagFlag = false;
private String insideTag;
private String tag;
private StringBuilder builder;

public void startElement(String uri, String localName, String qName,
        Attributes attributes) throws SAXException {

    if (qName.equalsIgnoreCase(tag)) {
        builder = new StringBuilder();
        tagFlag = true;
    }

    if (tagFlag) {
        builder.append("<" + qName);
         for (int i = 0; i < attributes.getLength(); i++) {
         builder.append(" " + attributes.getLocalName(i) + "=\"" +
         attributes.getValue(i) + "\"");
         }
         builder.append(">");
    }
}

public void endElement(String uri, String localName, String qName)
        throws SAXException {

    if (tagFlag) {
        builder.append("</" + qName + ">");
    }

    if (qName.equalsIgnoreCase(tag)) {
        insideTag = builder.toString();                     
        tagFlag = false;
    }
    System.out.println("End Element :" + qName);

}

public void characters(char ch[], int start, int length)
        throws SAXException {
    temp = ch;

    if (tagFlag) {
        builder.append(ch, start, length);
    }
}

public String getInsideTag(String tag) {
    this.tag = tag;
    return insideTag;
}

}

2 个解决方案

#1

The problem with your code is that you try to remember the start and end positions of the string passed to you via the characters method. What you see in the exception thrown is the result of an inside tag that starts near the end of a character buffer and ends near the beginning of the next character buffer.

代码的问题在于您尝试通过字符方法记住传递给您的字符串的开始和结束位置。在抛出的异常中看到的是内部标记的结果,该标记在字符缓冲区的末尾附近开始,并在下一个字符缓冲区的开头附近结束。

With sax you need to copy the characters when they are offered or the temporary buffer they occupy might be cleared when you need them.

使用sax时,您需要在提供字符时复制它们,或者在您需要时可以清除它们占用的临时缓冲区。

Your best bet is not to remember the positions in the buffers, but to create a new StringBuilder in startElement and add the characters to that, then get the complete string out the builder in endElement.

最好的办法是不记住缓冲区中的位置,而是在startElement中创建一个新的StringBuilder并将字符添加到该位置,然后在endElement中获取构建器中的完整字符串。

#2

Try to use Digester, I've used it years ago, version 1.5 and it were simply to create mapping for xml like you. Just simple article how to use Digester, but it is for version 1.5 and currently there is 3.0 I think last version contains a lot of new features ...

尝试使用Digester,我在几年前使用它,版本1.5,它只是像你一样为xml创建映射。只是简单的文章如何使用Digester,但它是1.5版本,目前有3.0我认为上一版本包含很多新功能...

#1