JAVA SAX解析器对字符()进行拆分调用

时间:2021-11-12 00:50:43

I am doing a project to parse some data from the XML.

我正在做一个项目来解析XML中的一些数据。

For example, the XML is

例如,XML是

<abc>abcdefghijklmno</abc>

I need to parse "abcdefghijkmnlp".

我需要解析“abcdefghikmnlp”。

But while I test my parse, I discover a big problem:

但是当我测试我的解析时,我发现了一个大问题:

public class parser{
    private boolean hasABC = false;


        //Constructor HERE
        ......................
        ......................

     @Override
     public void startDocument () throws SAXException{  
     }

     @Override
     public void endDocument () throws SAXException{  
     }

     @Override
     public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException{  
          if ("abc".equalsIgnoreCase(localName)) {
              this.hasABC = true;
          }
      }
      @Override
      public void endElement(String namespaceURI, String localName, String qName) throws SAXException{
            if ("abc".equalsIgnoreCase(localName)) {
                 this.hasABC = false;
            }
       }
       @Override
       public void characters(char ch[], int start, int length){
            String content = new String(ch, start, length).trim(); 
            if(this.hasABC){
                 System.out.println("ABC = " + content);
            }
        }
    }

I discover that the parser has parsed the tag two time System print out is,

我发现解析器解析了标签两次系统打印输出是,

ABC = abcdefghi

美国广播公司(ABC)= abcdefghi

ABC = jklmno <<============ split the message

ABC = jklmno <===========分割消息

Why the parser auto call back the characters() two time????

为什么解析器自动回调字符()两次???

Is the XML haveing some "\n" or "\r" ???

XML是否有一些“\n”或“\r”???

4 个解决方案

#1


33  

Parser is calling characters method more than one time, because it can and allowed per spec. This helps fast parser and keep their memory footprint low. If you want a single string create a new StringBuilder object in the startElement and process it on endElement method.

解析器不止一次调用字符方法,因为它可以并且允许每个规范调用字符方法。如果您想要一个字符串,请在startElement中创建一个新的StringBuilder对象,并在endElement方法中处理它。

#2


7  

You will be surprised but this is a documented behavior i.e. you can't assume that the parser will read and return all the text-data of an element in a single callback. I had the same experience earlier. You need to code to handle this situation or you can switch to Stax parser. You can use CharArrayWriter to accumulate the data across multiple callbacks.

您可能会感到惊讶,但这是一个文档化的行为,例如,您不能假设解析器将在一个回调中读取并返回一个元素的所有文本数据。我之前也有过同样的经历。您需要编写代码来处理这种情况,或者可以切换到Stax解析器。您可以使用CharArrayWriter来跨多个回调累积数据。

See below from the JavaDoc of ContentHandler.characters(...)

请参阅ContentHandler.characters(…)的JavaDoc。

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

解析器将调用此方法来报告字符数据的每个块。SAX解析器可以在一个块中返回所有连续的字符数据,也可以将其分割为几个块;但是,任何单个事件中的所有字符都必须来自相同的外部实体,以便定位器提供有用的信息。

#3


4  

You can change start, end and character method like:

你可以改变开始,结束和字符的方法,如:

  • add a "global" content variable
  • 添加“全局”内容变量
  • then null it in start method (content == null)
  • 然后在start方法中null it (content == null)
  • in end method u can println or add that content string to some object
  • 最后,方法u可以打印ln或将该内容字符串添加到某个对象中
  • in character method u can make if/else:

    在字符方法中,你可以使if/else:

    if (content == null)
    {
        content = new String(ch, start, length);
    } else {
        content += new String(ch, start, length);
    }
    

    Brutal way (better to do it with stringbuilder) but works and "string" is not longer splitted.

    残忍的方式(最好使用stringbuilder),但可以工作,而且“string”也不再分裂。

#4


3  

This is a feature of SAX. The parser can split the Text segments and call your characters method as many times as it likes.

这是SAX的一个特性。解析器可以分割文本段,并根据需要多次调用字符方法。

The reason for this is performance, which SAX prioritises over ease of use. SAX may have used up its internal buffer so to avoid copying it passes the data it has so far through to your code.

原因是性能,SAX将性能置于易用性之上。SAX可能已经使用了它的内部缓冲区,以避免复制它传递到您的代码的数据。

#1


33  

Parser is calling characters method more than one time, because it can and allowed per spec. This helps fast parser and keep their memory footprint low. If you want a single string create a new StringBuilder object in the startElement and process it on endElement method.

解析器不止一次调用字符方法,因为它可以并且允许每个规范调用字符方法。如果您想要一个字符串,请在startElement中创建一个新的StringBuilder对象,并在endElement方法中处理它。

#2


7  

You will be surprised but this is a documented behavior i.e. you can't assume that the parser will read and return all the text-data of an element in a single callback. I had the same experience earlier. You need to code to handle this situation or you can switch to Stax parser. You can use CharArrayWriter to accumulate the data across multiple callbacks.

您可能会感到惊讶,但这是一个文档化的行为,例如,您不能假设解析器将在一个回调中读取并返回一个元素的所有文本数据。我之前也有过同样的经历。您需要编写代码来处理这种情况,或者可以切换到Stax解析器。您可以使用CharArrayWriter来跨多个回调累积数据。

See below from the JavaDoc of ContentHandler.characters(...)

请参阅ContentHandler.characters(…)的JavaDoc。

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

解析器将调用此方法来报告字符数据的每个块。SAX解析器可以在一个块中返回所有连续的字符数据,也可以将其分割为几个块;但是,任何单个事件中的所有字符都必须来自相同的外部实体,以便定位器提供有用的信息。

#3


4  

You can change start, end and character method like:

你可以改变开始,结束和字符的方法,如:

  • add a "global" content variable
  • 添加“全局”内容变量
  • then null it in start method (content == null)
  • 然后在start方法中null it (content == null)
  • in end method u can println or add that content string to some object
  • 最后,方法u可以打印ln或将该内容字符串添加到某个对象中
  • in character method u can make if/else:

    在字符方法中,你可以使if/else:

    if (content == null)
    {
        content = new String(ch, start, length);
    } else {
        content += new String(ch, start, length);
    }
    

    Brutal way (better to do it with stringbuilder) but works and "string" is not longer splitted.

    残忍的方式(最好使用stringbuilder),但可以工作,而且“string”也不再分裂。

#4


3  

This is a feature of SAX. The parser can split the Text segments and call your characters method as many times as it likes.

这是SAX的一个特性。解析器可以分割文本段,并根据需要多次调用字符方法。

The reason for this is performance, which SAX prioritises over ease of use. SAX may have used up its internal buffer so to avoid copying it passes the data it has so far through to your code.

原因是性能,SAX将性能置于易用性之上。SAX可能已经使用了它的内部缓冲区,以避免复制它传递到您的代码的数据。