使用Python lxml时出错“加载外部实体失败”

时间:2022-10-10 12:34:32

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:

我正在尝试解析从Web检索的XML文档,但在解析此错误后崩溃了:

': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:

这是下载的XML中的第二行。有没有办法阻止解析器尝试加载外部实体,或者另一种方法来解决这个问题?这是我到目前为止的代码:

import urllib2
import lxml.etree as etree

file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()

tree = etree.parse(data)

4 个解决方案

#1


20  

In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.

与mzjn所说的一致,如果你想将字符串传递给etree.parse(),只需将其包装在StringIO对象中。

Example:

例:

from lxml import etree
from StringIO import StringIO

myString = "<html><p>blah blah blah</p></html>"

tree = etree.parse(StringIO(myString))

This method is used in the lxml documentation.

此方法用于lxml文档。

#2


9  

etree.parse(source) expects source to be one of

etree.parse(source)期望source是其中之一

  • a file name/path
  • 文件名/路径
  • a file object
  • 一个文件对象
  • a file-like object
  • 一个类似文件的对象
  • a URL using the HTTP or FTP protocol
  • 使用HTTP或FTP协议的URL

The problem is that you are supplying the XML content as a string.

问题是您将XML内容作为字符串提供。

You can also do without urllib2.urlopen(). Just use

你也可以不用urllib2.urlopen()。只是用

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

Demonstration (using lxml 2.3.4):

演示(使用lxml 2.3.4):

>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>   

In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.

在竞争的答案中,由于文档中处理指令引用的样式表,建议lxml失败。但这不是问题所在。 lxml不会尝试加载样式表,如果您按上述方法执行,则会解析XML文档。

If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:

如果要实际加载样式表,则必须明确它。需要这样的东西:

from lxml import etree

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0] 

# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()   

#3


1  

lxml docs for parse says To parse from a string, use the fromstring() function instead.

用于解析的lxml docs说要从字符串解析,请使用fromstring()函数。

parse(...)
    parse(source, parser=None, base_url=None)

    Return an ElementTree object loaded with source elements.  If no parser
    is provided as second argument, the default parser is used.

    The ``source`` can be any of the following:

    - a file name/path
    - a file object
    - a file-like object
    - a URL using the HTTP or FTP protocol

    To parse from a string, use the ``fromstring()`` function instead.

    Note that it is generally faster to parse from a file path or URL
    than from an open file object or file-like object.  Transparent
    decompression from gzip compressed sources is supported (unless
    explicitly disabled in libxml2).

#4


0  

You're getting that error because the XML you're loading references an external resource:

您收到该错误,因为您加载的XML引用了外部资源:

<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.

LXML不知道如何解析GreenButtonDataStyleSheet.xslt。你和我可能已经意识到它相对于你的原始URL可用了,http://www.greenbuttondata.org/data/15MinLP_15Days.xml...诀窍是告诉lxml如何加载它。

The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

lxml文档包含一个标题为“文档加载和URL解析”的部分,其中包含您需要的所有信息。

#1


20  

In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.

与mzjn所说的一致,如果你想将字符串传递给etree.parse(),只需将其包装在StringIO对象中。

Example:

例:

from lxml import etree
from StringIO import StringIO

myString = "<html><p>blah blah blah</p></html>"

tree = etree.parse(StringIO(myString))

This method is used in the lxml documentation.

此方法用于lxml文档。

#2


9  

etree.parse(source) expects source to be one of

etree.parse(source)期望source是其中之一

  • a file name/path
  • 文件名/路径
  • a file object
  • 一个文件对象
  • a file-like object
  • 一个类似文件的对象
  • a URL using the HTTP or FTP protocol
  • 使用HTTP或FTP协议的URL

The problem is that you are supplying the XML content as a string.

问题是您将XML内容作为字符串提供。

You can also do without urllib2.urlopen(). Just use

你也可以不用urllib2.urlopen()。只是用

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

Demonstration (using lxml 2.3.4):

演示(使用lxml 2.3.4):

>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>   

In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.

在竞争的答案中,由于文档中处理指令引用的样式表,建议lxml失败。但这不是问题所在。 lxml不会尝试加载样式表,如果您按上述方法执行,则会解析XML文档。

If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:

如果要实际加载样式表,则必须明确它。需要这样的东西:

from lxml import etree

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0] 

# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()   

#3


1  

lxml docs for parse says To parse from a string, use the fromstring() function instead.

用于解析的lxml docs说要从字符串解析,请使用fromstring()函数。

parse(...)
    parse(source, parser=None, base_url=None)

    Return an ElementTree object loaded with source elements.  If no parser
    is provided as second argument, the default parser is used.

    The ``source`` can be any of the following:

    - a file name/path
    - a file object
    - a file-like object
    - a URL using the HTTP or FTP protocol

    To parse from a string, use the ``fromstring()`` function instead.

    Note that it is generally faster to parse from a file path or URL
    than from an open file object or file-like object.  Transparent
    decompression from gzip compressed sources is supported (unless
    explicitly disabled in libxml2).

#4


0  

You're getting that error because the XML you're loading references an external resource:

您收到该错误,因为您加载的XML引用了外部资源:

<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.

LXML不知道如何解析GreenButtonDataStyleSheet.xslt。你和我可能已经意识到它相对于你的原始URL可用了,http://www.greenbuttondata.org/data/15MinLP_15Days.xml...诀窍是告诉lxml如何加载它。

The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

lxml文档包含一个标题为“文档加载和URL解析”的部分,其中包含您需要的所有信息。