使用DOM (Java)解析XML文件

时间:2022-12-01 13:30:10

I want to parse the following url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nucleotide&id=224589801

我想解析以下url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nucleotide&id=224589801

As a result I came up with the following method:

因此我想到了以下方法:

public void parseXml2(String URL) {
    DOMParser parser = new DOMParser();

    try {
        parser.parse(new InputSource(new URL(URL).openStream()));
        Document doc = parser.getDocument();

        NodeList nodeList = doc.getElementsByTagName("Item");
        for (int i = 0; i < nodeList.getLength(); i++) {
            Node n = nodeList.item(i);
            Node actualNode = n.getFirstChild();
            if (actualNode != null) {
                System.out.println(actualNode.getNodeValue());
            }
        }

    } catch (SAXException ex) {
        Logger.getLogger(TaxMapperXml.class.getName()).log(Level.SEVERE, null, ex);
    } catch (IOException ex) {
        Logger.getLogger(TaxMapperXml.class.getName()).log(Level.SEVERE, null, ex);
    }
}

With this method I can take the values of the Item nodes but I can't take any of their attributes. I tried experimenting with getAttribute() with NamedNodeMap but still to no avail.

使用此方法,我可以获取项节点的值,但不能获取它们的任何属性。我尝试使用getAttribute()和NamedNodeMap进行实验,但仍然没有效果。

  1. Why do I have to do n.getFirstChild().getNodeValue(); to get the actual value? n.getNodeValue() returns just null? Isn't this counter-intuitive - obviously in my case node's doesn't have subnodes?

    为什么我要做n.getFirstChild().getNodeValue();得到实际值?n.getNodeValue()返回null ?这难道不是违反直觉吗——显然在我的例子中节点没有子节点?

  2. Is there some more robust and widely accepted way of parsing XML files using DOM? My files aren't gonna be big 15-20 lines at most, so SAX isn't necessary (or is it?)

    是否有更健壮且广泛接受的方法来使用DOM解析XML文件?我的文件最多不超过15-20行,所以不需要SAX(或者是SAX ?)

3 个解决方案

#1


5  

import java.io.IOException;
import java.net.URL;
import org.apache.xerces.parsers.DOMParser;

import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class XMLParser {

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        parseXml2("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nucleotide&id=224589801");
    }

    public static void parseXml2(String URL) {
        DOMParser parser = new DOMParser();

        try {
            parser.parse(new InputSource(new URL(URL).openStream()));
            Document doc = parser.getDocument();

            NodeList nodeList = doc.getElementsByTagName("Item");
            for (int i = 0; i < nodeList.getLength(); i++) {
                System.out.print("Item "+(i+1));
                Node n = nodeList.item(i);
                NamedNodeMap m = n.getAttributes();
                System.out.print(" Name: "+m.getNamedItem("Name").getTextContent());
                System.out.print(" Type: "+m.getNamedItem("Type").getTextContent());
                Node actualNode = n.getFirstChild();
                if (actualNode != null) {
                    System.out.println(" "+actualNode.getNodeValue());
                } else {
                    System.out.println(" ");                    
                }
            }

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

Completed the sample code and added a few lines to get the attributes.

完成示例代码并添加了几行以获取属性。

This should get you started, although I feel that you need to get yourself up to date with the basic notions of DOM. This site (and many others) can help you with that. Most importantly is understanding the different kinds of nodes there are.

这将使您开始学习,尽管我认为您需要了解DOM的基本概念。这个站点(以及其他许多站点)可以帮助您实现这一点。最重要的是理解不同类型的节点。

#2


6  

  1. Text value that is surrounded by XML tag are also considered as Node in DOM. That's why you have to get the text Node before getting the value. If you try to count the number of node in an <Item>, you will see that whenever there is a text, there is a node.

    在DOM中,由XML标记包围的文本值也被视为节点。这就是为什么在获取值之前必须获取文本节点。如果您尝试计算 中的节点数,您将看到,只要有文本,就有节点。

  2. XOM has more intuitive interface but it doesn't have org.w3c.dom.* interface.

    XOM有更直观的界面,但它没有org.w3c.dom。*接口。

If you want to use the build-in parser, you should look at http://www.java-samples.com/showtutorial.php?tutorialid=152

如果您想使用内置解析器,应该查看http://www.java- samples.com/showtututorial.php? tutorid= 152

The DOMParser you tried to use are propriety and it's not portable.

您试图使用的DOMParser是恰当的,它是不可移植的。

#3


1  

Text inside xml elements are in text nodes because subelements can be mixed with text. For instance:

xml元素中的文本位于文本节点中,因为子元素可以与文本混合。例如:

...
<A>blah<B/>blah</A>
...

Element A has three children: a text node, element B, another text node.

元素A有三个子节点:一个文本节点,元素B,另一个文本节点。

#1


5  

import java.io.IOException;
import java.net.URL;
import org.apache.xerces.parsers.DOMParser;

import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public class XMLParser {

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        parseXml2("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=nucleotide&id=224589801");
    }

    public static void parseXml2(String URL) {
        DOMParser parser = new DOMParser();

        try {
            parser.parse(new InputSource(new URL(URL).openStream()));
            Document doc = parser.getDocument();

            NodeList nodeList = doc.getElementsByTagName("Item");
            for (int i = 0; i < nodeList.getLength(); i++) {
                System.out.print("Item "+(i+1));
                Node n = nodeList.item(i);
                NamedNodeMap m = n.getAttributes();
                System.out.print(" Name: "+m.getNamedItem("Name").getTextContent());
                System.out.print(" Type: "+m.getNamedItem("Type").getTextContent());
                Node actualNode = n.getFirstChild();
                if (actualNode != null) {
                    System.out.println(" "+actualNode.getNodeValue());
                } else {
                    System.out.println(" ");                    
                }
            }

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

Completed the sample code and added a few lines to get the attributes.

完成示例代码并添加了几行以获取属性。

This should get you started, although I feel that you need to get yourself up to date with the basic notions of DOM. This site (and many others) can help you with that. Most importantly is understanding the different kinds of nodes there are.

这将使您开始学习,尽管我认为您需要了解DOM的基本概念。这个站点(以及其他许多站点)可以帮助您实现这一点。最重要的是理解不同类型的节点。

#2


6  

  1. Text value that is surrounded by XML tag are also considered as Node in DOM. That's why you have to get the text Node before getting the value. If you try to count the number of node in an <Item>, you will see that whenever there is a text, there is a node.

    在DOM中,由XML标记包围的文本值也被视为节点。这就是为什么在获取值之前必须获取文本节点。如果您尝试计算 中的节点数,您将看到,只要有文本,就有节点。

  2. XOM has more intuitive interface but it doesn't have org.w3c.dom.* interface.

    XOM有更直观的界面,但它没有org.w3c.dom。*接口。

If you want to use the build-in parser, you should look at http://www.java-samples.com/showtutorial.php?tutorialid=152

如果您想使用内置解析器,应该查看http://www.java- samples.com/showtututorial.php? tutorid= 152

The DOMParser you tried to use are propriety and it's not portable.

您试图使用的DOMParser是恰当的,它是不可移植的。

#3


1  

Text inside xml elements are in text nodes because subelements can be mixed with text. For instance:

xml元素中的文本位于文本节点中,因为子元素可以与文本混合。例如:

...
<A>blah<B/>blah</A>
...

Element A has three children: a text node, element B, another text node.

元素A有三个子节点:一个文本节点,元素B,另一个文本节点。