如何使用SAX正确解析XML?

时间:2022-10-26 20:56:46

I am receiving an XML document from a REST service which shall be parsed using SAX. Please see the following example which was generated out of the XSD.

我从REST服务接收XML文档,该文档将使用SAX进行解析。请参阅以下由XSD生成的示例。

Setting up the parser is not a problem. My main problem is the actual processing in the startElement(), endElement() methods etc. I don't understand how to extract the items I need and store them as they are somewhat "nested".

设置解析器不是问题。我的主要问题是startElement(),endElement()方法等中的实际处理。我不明白如何提取我需要的项目并存储它们,因为它们有点“嵌套”。

Example

The ConnectionList can occur once or twice and may contain any number of Connection elements which -in turn- have details about a connection. Basically, I need a list of all connections with their Date, Transfers and Time. Do I have to create one class per element?

ConnectionList可以出现一次或两次,并且可以包含任意数量的Connection元素,其中包含有关连接的详细信息。基本上,我需要一个包含日期,转移和时间的所有连接的列表。我是否必须为每个元素创建一个类?

As far as I got it I somehow need to do the following: If the parser comes across a...

据我所知,我不知何故需要做以下事情:如果解析器遇到...

  • ConnectionList: Create new ConnectionList object and put it into a list of ConnectionLists
  • ConnectionList:创建新的ConnectionList对象并将其放入ConnectionLists列表中
  • Connection: Create a new Connection object and put it into a list of Connections
  • 连接:创建一个新的Connection对象并将其放入Connections列表中
  • Date, Transfers, Time (only if parent is Duration): Store the node value in the current Connection object
  • 日期,转移,时间(仅当父级是持续时间时):将节点值存储在当前的Connection对象中

I'd really appreciate any help, hint, idea, snippet how I can achieve this.

我真的很感激任何帮助,提示,想法,摘要我是如何实现这一目标的。

Thanks :-)

谢谢 :-)

Robert

罗伯特

<?xml version="1.0" encoding="UTF-8"?>
<ResC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Err code="r5E5a1Wm" text="tk-gWYbw" level="E"/>
    <Err code="takVDd34" text="XtvyjmjPuscK" level="E"/>
    <Err code="hQ1-:aDQ" text="YWc5qtY.gkwCeJW2S" level="E"/>
    <ConRes dir="R">
        <Err code="ZfwPC:tj" text="RKKFuLXoM0oOfp3a" level="E"/>
        <Err code="bhDjSJPa" text="BJoHuOMdwzhcddW" level="E"/>
        <Err code="CX-NhK9r" text="j55qy-WiNPXu" level="E"/>
        <ConResCtxt b="1" f="1">0815</ConResCtxt>
        <ConnectionList type="IV">
            <Err code="WI3WX.jo" text="rK3H5jwa-Zfen3" level="E"/>
            <Connection id="ID000">
                <Overview>
                    <Date>b3lcM_Yiyq7dqL9</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-1086549314">
                            <Address externalId="t.EdKe93xkqFqLwPzgd-4vHSJemy8"
                                externalStationNr="1332105793" name="fdREYJPu83WV503V8szdCX"
                                x="951177990" y="-1579782776" z="1807457957" type="WGS84"/>
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="1897526979">
                            <Address externalId="l7h_GTUit6fv" externalStationNr="-1670310329"
                                name="WJznDTzkTvyET51pfr7X" x="-1738098662" y="-170353174"
                                z="-475585957" type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>dZbgZfDH8j1hb1i</Transfers>
                    <Duration>
                        <Time>00d00:18:00</Time>
                    </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="qmrN2dShHJp"/>
                        <Product cat="Hg"/>
                        <Product cat="nurxhdl3w.P0x7FRv2J3UoF"/>
                    </Products>
                    <ContextURL url="http://FzgEqiVC/"/>
                </Overview>
            </Connection>
            <Connection id="ID004">
                <Overview>
                    <Date>W5a47DRkc7XDZjhwq_s5Un.</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-1014429844">
                            <Address externalId="RMnzjEFOTTdM1oaAUw" externalStationNr="1429101638"
                                name="HF-1" x="1005198487" y="570832676" z="975615566" type="WGS84"
                            />
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="-58308182">
                            <Address externalId="rVdwdQvAukfj2QcA7b3OSdGOyW"
                                externalStationNr="1142334006" name="g" x="-1791416159"
                                y="-541300941" z="478129823" type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>GG56XN6zgiJF804mE_N4o</Transfers>
                    <Duration> </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="fs_Oyoy9NYBai-qaxbty6j9Y7r1St"/>
                        <Product cat="P2CbaSGpC"/>
                        <Product cat="CGZrqSIDM6M4kUlb8_xZ8jRlH4c"/>
                    </Products>
                    <ContextURL url="http://JkRhuXtu/"/>
                </Overview>
            </Connection>
        </ConnectionList>
        <ConnectionList type="IV">
            <Err code="0lFWRY2X" text="KLmdczFRhV" level="E"/>
            <Connection id="ID012">
                <Overview>
                    <Date>t8mn634zjCZsRPyxj_e_-UYMH</Date>
                    <Departure>
                        <BasicStop type="NORMAL" index="-2095085423">
                            <Address externalId="ftKAFG-Uk7x" externalStationNr="1390920810"
                                name="JQrQXOQbm.FLaCMeSiTYjT" x="1970142849" y="-655980297"
                                z="2102464970" type="WGS84"/>
                        </BasicStop>
                    </Departure>
                    <Arrival>
                        <BasicStop type="NORMAL" index="1552118247">
                            <Address externalId="qcBpeuPDRzvSt1o" externalStationNr="-1133118359"
                                name="AJiJOB1t" x="-1422533132" y="-1158953133" z="484831466"
                                type="WGS84"/>
                        </BasicStop>
                    </Arrival>
                    <Transfers>D0MiUwW9nuuM_uykvawg2C07pwHL</Transfers>
                    <Duration> </Duration>
                    <ServiceDays> </ServiceDays>
                    <Products>
                        <Product cat="LpGOZbLDbJm"/>
                        <Product cat="JIv-szQVX2icPb"/>
                        <Product cat="Q7-pthWoOT"/>
                    </Products>
                    <ContextURL url="http://zGWgivvi/"/>
                </Overview>
                <IList>
                    <I header="ze4Wt3hVD-DvjujY6QKae" text="lVwB4RxAHcYq3.F"
                        uriCustom="iVjQJCoU1MVOv2Z9lwarP"/>
                    <I header="z-i.au59soMzXLZCbV" text="PoTP" uriCustom="ksrbwEH6scNR"/>
                    <I header="N" text="jHDA4" uriCustom="ub95811lMIa_495ZbPOuNWL0rRWh"/>
                </IList>
                <CommentList>
                    <Comment id="ID013">
                        <Text lang="EN"> </Text>
                        <Text lang="FR"> </Text>
                        <Text lang="PL"> </Text>
                    </Comment>
                    <Comment id="ID014">
                        <Text lang="DK"> </Text>
                        <Text lang="IT"> </Text>
                        <Text lang="IT"> </Text>
                    </Comment>
                    <Comment id="ID015">
                        <Text lang="MACRO"> </Text>
                        <Text lang="IT"> </Text>
                        <Text lang="EN"> </Text>
                    </Comment>
                </CommentList>
            </Connection>
        </ConnectionList>
    </ConRes>
</ResC>

6 个解决方案

#1


7  

The best way I've found (so far) of parsing XML using SAX is to use a stack and conditional statements in the relevant callbacks. Here's an article describing it, and my summary of it:

我发现(到目前为止)使用SAX解析XML的最好方法是在相关的回调中使用堆栈和条件语句。这是一篇描述它的文章,以及我对它的总结:

The basic premise is that as you parse the document, you create objects to store the parsed data, pushing them onto the stack as you go, peeking at the top of the stack to add data to the current element, and at the end of each element popping it off the stack and storing it in the parent.

基本前提是,在解析文档时,您可以创建对象来存储已分析的数据,随时将它们推送到堆栈中,查看堆栈顶部以将数据添加到当前元素,并在每个元素的末尾元素将其从堆栈中弹出并将其存储在父级中。

The effect is that you parse the tree of elements depth first, and at the end of each branch you roll it back into the parent until you're left with a single object (such as your ConnectionList) that contains all of the parsed data ready to be used. Essentially, you end up with a series of objects that mirror the structure of the original XML

结果是,您首先解析元素树的深度,然后在每个分支的末尾将其回滚到父级,直到您留下包含所有已解析数据的单个对象(例如您的ConnectionList)为止要使用的。从本质上讲,您最终会得到一系列镜像原始XML结构的对象

That means you need some data objects that can store the data in the same structure as the XML. Complex elements will normally become classes, while simple elements will normally be attributes within classes. The root element is often represented by a list of some kind.

这意味着您需要一些可以将数据存储在与XML相同的结构中的数据对象。复杂元素通常会成为类,而简单元素通常是类中的属性。根元素通常由某种列表表示。

To start with, you create a stack object to hold the data as you parse it.

首先,创建一个堆栈对象,以便在解析数据时保存数据。

Then, at the start of each element you identify what type it is using localName.equals() method, create an instance of the appropriate class, and push it into the Stack. If the element is a simple element, you will probably model that as an attribute in the class representing the parent element, and you will need a series of flags that tells the parser if such an element is encountered and what element it is so it can be processed in the characters() method.

然后,在每个元素的开头,使用localName.equals()方法确定它的类型,创建相应类的实例,并将其推入堆栈。如果元素是一个简单元素,您可能会将其作为表示父元素的类中的属性进行建模,并且您将需要一系列标志来告诉解析器是否遇到这样的元素以及它是什么元素因此它可以在characters()方法中处理。

The actual data is read using the characters() method, and again you use conditional logic to determine what to do with the data, based on the value of the flag. Essentially, you peek at the top of the stack and use the appropriate method to write the data into the object, converting from text where necessary.

使用characters()方法读取实际数据,并再次使用条件逻辑根据标志的值确定如何处理数据。基本上,您可以查看堆栈的顶部并使用适当的方法将数据写入对象,并在必要时从文本进行转换。

At the end of each element, you pop the top of the stack and use localName.equals() again to determine how to store it in the object before it (e.g. which setter method needs to be called)

在每个元素的末尾,弹出堆栈的顶部并再次使用localName.equals()来确定如何将它存储在它之前的对象中(例如,需要调用哪个setter方法)

When you reach the end of the document you should have captured all the data in the document.

当您到达文档的末尾时,您应该已捕获文档中的所有数据。

#2


6  

Your SAX event handler should act as a state machine. Your structure is pretty deep, so the state machine will be a bit complex; but this is the basic approach:

您的SAX事件处理程序应充当状态机。你的结构很深,所以状态机会有点复杂;但这是基本方法:

All variables are member variables.

所有变量都是成员变量。

When you encounter a startElement event, you instantiate an object representing that element then put the object on a stack (or set a flag indicating what value you are working with).

遇到startElement事件时,实例化表示该元素的对象,然后将该对象放在堆栈上(或设置一个标志,指示您正在使用的值)。

When you encounter a text event, read the text and set the appropriate value based on the flag you set in the previous step.

遇到文本事件时,请阅读文本并根据您在上一步中设置的标志设置适当的值。

When you encounter a endElement event, you pull the current object off the stack and call the setter on the object that is now on the top of the stack.

遇到endElement事件时,将当前对象拉出堆栈并调用现在位于堆栈顶部的对象上的setter。

When you exhaust the document, you should only have one object left on the stack which represents everything you've read.

当你耗尽文档时,你应该只有一个对象留在堆栈上,代表你读过的所有内容。

#3


1  

If it's a reasonably small xml document and the memory/throughput constraints aren't prohibitive to an in memory solution, then you could use JAXB instead. You can generate the required classes from the XSD and simply unmarshall the xml into java objects. If you must use a streaming parser, then consider using StAX instead, I generally find this more intuitive.

如果它是一个相当小的xml文档,并且内存/吞吐量限制对于内存解决方案来说并不过分,那么您可以使用JAXB。您可以从XSD生成所需的类,只需将xml解组为java对象即可。如果你必须使用流解析器,那么考虑使用StAX,我通常会发现这更直观。

#4


1  

Generally speaking you have a couple choices:

一般来说,你有几个选择:

  1. Use custom objects to map the XML to, these objects will encapsulate more objects much like the XML elements nest.
  2. 使用自定义对象将XML映射到,这些对象将封装更多对象,就像XML元素嵌套一样。
  3. Do a generic parsing and traverse the DOM via relative elements.
  4. 进行通用解析并通过相关元素遍历DOM。

To my knowledge there are some tools out there such as JAXB which will generate your classes based on XSD's, but they can sometimes come with a price as generated code often does.

据我所知,有一些工具,比如JAXB,它们将根据XSD生成你的类,但它们有时会带来生成代码的价格。

If you go with option 1 and "roll your own" you'll need to provide methods for unmarshaling and marshaling that go to and from XML and most likely Strings. Something like:

如果你使用选项1并“自己动手”,则需要提供进出XML和最有可能的字符串的解组和编组方法。就像是:

<Foo>
  <Bar>
    <Baz></Baz>
  </Bar>
  <Thing></Thing>
</Foo>

// pseudo-code!
//In Foo.java
unmarshal( Element element ) {
 unmarshalBar( element );
 unmarshalThing( element );
}

unmarshalBar( Element element ) {
 //...verify that the element is bar
 bar = new Bar();
 bar.unmarshal( element );
}

//In Bar.java
unmarshal( Element element ) {
 unmarshalBaz( element );
}

Hope this helps.

希望这可以帮助。

#5


1  

I usually put objects on a stack, and push/pop them while parsing the XML file (particularly useful if objects are nested, but that's not your case).

我通常将对象放在堆栈上,并在解析XML文件时推送/弹出它们(如果对象是嵌套的,则特别有用,但这不是你的情况)。

If you want a simpler approach, you need at a pointer to the current ConnectionList and to the current Connection. Since you already know the structure of your file, this could be easier than using a stack-based parser.

如果您想要一种更简单的方法,则需要指向当前ConnectionList和当前Connection的指针。由于您已经知道文件的结构,因此这比使用基于堆栈的解析器更容易。

#6


1  

SAX parsers are a bit like looking at a large picture through a tiny spy hole.

SAX解析器有点像通过一个小间谍孔看大图。

The callback will present you with a single piece of the XML structure at a time. It wont give you any clues as to where you are in the document only a single piece of data is presented,. The element name, the attribute name/value or the text contents.

回调将一次为您呈现一个XML结构。它不会给你任何线索,因为你在文档中的位置只显示了一个数据。元素名称,属性名称/值或文本内容。

Your program needs to track where you are in the document. If you are parsing on the fly a simple stack structure will do -- you push the name onto the stack when you get a "beginelement" and you pop the stack on an "endelement".

您的程序需要跟踪您在文档中的位置。如果您正在动态解析一个简单的堆栈结构 - 当您获得“beginelement”时,您将名称推送到堆栈上,然后将您的堆栈弹出“endelement”。

If you find yourself building a tree structure I would switch to a DOM parser as whatever you write will be a pale and buggy shadow of something like XERCES.

如果你发现自己正在构建一个树结构,我会切换到一个DOM解析器,因为你写的任何东西都会像XERCES这样苍白而多变的阴影。

#1


7  

The best way I've found (so far) of parsing XML using SAX is to use a stack and conditional statements in the relevant callbacks. Here's an article describing it, and my summary of it:

我发现(到目前为止)使用SAX解析XML的最好方法是在相关的回调中使用堆栈和条件语句。这是一篇描述它的文章,以及我对它的总结:

The basic premise is that as you parse the document, you create objects to store the parsed data, pushing them onto the stack as you go, peeking at the top of the stack to add data to the current element, and at the end of each element popping it off the stack and storing it in the parent.

基本前提是,在解析文档时,您可以创建对象来存储已分析的数据,随时将它们推送到堆栈中,查看堆栈顶部以将数据添加到当前元素,并在每个元素的末尾元素将其从堆栈中弹出并将其存储在父级中。

The effect is that you parse the tree of elements depth first, and at the end of each branch you roll it back into the parent until you're left with a single object (such as your ConnectionList) that contains all of the parsed data ready to be used. Essentially, you end up with a series of objects that mirror the structure of the original XML

结果是,您首先解析元素树的深度,然后在每个分支的末尾将其回滚到父级,直到您留下包含所有已解析数据的单个对象(例如您的ConnectionList)为止要使用的。从本质上讲,您最终会得到一系列镜像原始XML结构的对象

That means you need some data objects that can store the data in the same structure as the XML. Complex elements will normally become classes, while simple elements will normally be attributes within classes. The root element is often represented by a list of some kind.

这意味着您需要一些可以将数据存储在与XML相同的结构中的数据对象。复杂元素通常会成为类,而简单元素通常是类中的属性。根元素通常由某种列表表示。

To start with, you create a stack object to hold the data as you parse it.

首先,创建一个堆栈对象,以便在解析数据时保存数据。

Then, at the start of each element you identify what type it is using localName.equals() method, create an instance of the appropriate class, and push it into the Stack. If the element is a simple element, you will probably model that as an attribute in the class representing the parent element, and you will need a series of flags that tells the parser if such an element is encountered and what element it is so it can be processed in the characters() method.

然后,在每个元素的开头,使用localName.equals()方法确定它的类型,创建相应类的实例,并将其推入堆栈。如果元素是一个简单元素,您可能会将其作为表示父元素的类中的属性进行建模,并且您将需要一系列标志来告诉解析器是否遇到这样的元素以及它是什么元素因此它可以在characters()方法中处理。

The actual data is read using the characters() method, and again you use conditional logic to determine what to do with the data, based on the value of the flag. Essentially, you peek at the top of the stack and use the appropriate method to write the data into the object, converting from text where necessary.

使用characters()方法读取实际数据,并再次使用条件逻辑根据标志的值确定如何处理数据。基本上,您可以查看堆栈的顶部并使用适当的方法将数据写入对象,并在必要时从文本进行转换。

At the end of each element, you pop the top of the stack and use localName.equals() again to determine how to store it in the object before it (e.g. which setter method needs to be called)

在每个元素的末尾,弹出堆栈的顶部并再次使用localName.equals()来确定如何将它存储在它之前的对象中(例如,需要调用哪个setter方法)

When you reach the end of the document you should have captured all the data in the document.

当您到达文档的末尾时,您应该已捕获文档中的所有数据。

#2


6  

Your SAX event handler should act as a state machine. Your structure is pretty deep, so the state machine will be a bit complex; but this is the basic approach:

您的SAX事件处理程序应充当状态机。你的结构很深,所以状态机会有点复杂;但这是基本方法:

All variables are member variables.

所有变量都是成员变量。

When you encounter a startElement event, you instantiate an object representing that element then put the object on a stack (or set a flag indicating what value you are working with).

遇到startElement事件时,实例化表示该元素的对象,然后将该对象放在堆栈上(或设置一个标志,指示您正在使用的值)。

When you encounter a text event, read the text and set the appropriate value based on the flag you set in the previous step.

遇到文本事件时,请阅读文本并根据您在上一步中设置的标志设置适当的值。

When you encounter a endElement event, you pull the current object off the stack and call the setter on the object that is now on the top of the stack.

遇到endElement事件时,将当前对象拉出堆栈并调用现在位于堆栈顶部的对象上的setter。

When you exhaust the document, you should only have one object left on the stack which represents everything you've read.

当你耗尽文档时,你应该只有一个对象留在堆栈上,代表你读过的所有内容。

#3


1  

If it's a reasonably small xml document and the memory/throughput constraints aren't prohibitive to an in memory solution, then you could use JAXB instead. You can generate the required classes from the XSD and simply unmarshall the xml into java objects. If you must use a streaming parser, then consider using StAX instead, I generally find this more intuitive.

如果它是一个相当小的xml文档,并且内存/吞吐量限制对于内存解决方案来说并不过分,那么您可以使用JAXB。您可以从XSD生成所需的类,只需将xml解组为java对象即可。如果你必须使用流解析器,那么考虑使用StAX,我通常会发现这更直观。

#4


1  

Generally speaking you have a couple choices:

一般来说,你有几个选择:

  1. Use custom objects to map the XML to, these objects will encapsulate more objects much like the XML elements nest.
  2. 使用自定义对象将XML映射到,这些对象将封装更多对象,就像XML元素嵌套一样。
  3. Do a generic parsing and traverse the DOM via relative elements.
  4. 进行通用解析并通过相关元素遍历DOM。

To my knowledge there are some tools out there such as JAXB which will generate your classes based on XSD's, but they can sometimes come with a price as generated code often does.

据我所知,有一些工具,比如JAXB,它们将根据XSD生成你的类,但它们有时会带来生成代码的价格。

If you go with option 1 and "roll your own" you'll need to provide methods for unmarshaling and marshaling that go to and from XML and most likely Strings. Something like:

如果你使用选项1并“自己动手”,则需要提供进出XML和最有可能的字符串的解组和编组方法。就像是:

<Foo>
  <Bar>
    <Baz></Baz>
  </Bar>
  <Thing></Thing>
</Foo>

// pseudo-code!
//In Foo.java
unmarshal( Element element ) {
 unmarshalBar( element );
 unmarshalThing( element );
}

unmarshalBar( Element element ) {
 //...verify that the element is bar
 bar = new Bar();
 bar.unmarshal( element );
}

//In Bar.java
unmarshal( Element element ) {
 unmarshalBaz( element );
}

Hope this helps.

希望这可以帮助。

#5


1  

I usually put objects on a stack, and push/pop them while parsing the XML file (particularly useful if objects are nested, but that's not your case).

我通常将对象放在堆栈上,并在解析XML文件时推送/弹出它们(如果对象是嵌套的,则特别有用,但这不是你的情况)。

If you want a simpler approach, you need at a pointer to the current ConnectionList and to the current Connection. Since you already know the structure of your file, this could be easier than using a stack-based parser.

如果您想要一种更简单的方法,则需要指向当前ConnectionList和当前Connection的指针。由于您已经知道文件的结构,因此这比使用基于堆栈的解析器更容易。

#6


1  

SAX parsers are a bit like looking at a large picture through a tiny spy hole.

SAX解析器有点像通过一个小间谍孔看大图。

The callback will present you with a single piece of the XML structure at a time. It wont give you any clues as to where you are in the document only a single piece of data is presented,. The element name, the attribute name/value or the text contents.

回调将一次为您呈现一个XML结构。它不会给你任何线索,因为你在文档中的位置只显示了一个数据。元素名称,属性名称/值或文本内容。

Your program needs to track where you are in the document. If you are parsing on the fly a simple stack structure will do -- you push the name onto the stack when you get a "beginelement" and you pop the stack on an "endelement".

您的程序需要跟踪您在文档中的位置。如果您正在动态解析一个简单的堆栈结构 - 当您获得“beginelement”时,您将名称推送到堆栈上,然后将您的堆栈弹出“endelement”。

If you find yourself building a tree structure I would switch to a DOM parser as whatever you write will be a pale and buggy shadow of something like XERCES.

如果你发现自己正在构建一个树结构,我会切换到一个DOM解析器,因为你写的任何东西都会像XERCES这样苍白而多变的阴影。