使用ElementTree在python中进行xml解析

时间:2023-01-19 21:47:56

I'm very new to python and I need to parse some dirty xml files which need sanitising first.

我是python的新手,我需要解析一些需要首先进行清理的脏xml文件。

I have the following python code:

我有以下python代码:

import arff
import xml.etree.ElementTree
import re

totstring=""

with open('input.sgm', 'r') as inF:
    for line in inF:
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line)
    totstring+=string


data=xml.etree.ElementTree.fromstring(totstring)

print data

file.close

which parses:

解析:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;C T
&#22;&#22;&#1;f0704&#31;reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>&#2;
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
    In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
    Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
    Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
    New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
    Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
    April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
    Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
    Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
    Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
    Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
    Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
    Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

How can I now go about getting just the text from inside the body tag?

我现在怎样才能从body标签内部获取文本?

All the tutorials i have seen rely on reading the xml directly from a file so that Elementtree.parse works. As I am trying to parse from a string this will not work and this breaks a lot of tutorials I have read.

我看到的所有教程都依赖于直接从文件中读取xml,以便Elementtree.parse可以正常工作。因为我试图从字符串解析这将无法工作,这打破了我读过的很多教程。

Thanks very much

非常感谢

3 个解决方案

#1


5  

If you don't care about the particular structure of a (potentially messy) XML document and just want to quickly get the contents of a given tag/element, you may want to try the BeautifulSoup module.

如果您不关心(可能是凌乱的)XML文档的特定结构,并且只想快速获取给定标记/元素的内容,您可能需要尝试使用BeautifulSoup模块。

import BeautifulSoup
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(totstring)

body = soup.find("body")

bodytext = body.text

#2


4  

Your first clue could be when you get messages like this...

你的第一个线索可能就是当你得到这样的消息时......

>>> from xml.etree import ElementTree
>>> parse = ElementTree.parse('foo.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse
    parser.feed(data)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: reference to invalid character number: line 11, column 0
>>>

This error comes from invalid characters in the XML source. You need to clean up the invalid characters (see fix_xml.py at bottom of my answer).

此错误来自XML源中的无效字符。您需要清理无效字符(请参阅我的答案底部的fix_xml.py)。

After you have clean XML, it's pretty easy. You should use StringIO to treat strings as files:

拥有干净的XML之后,它非常简单。您应该使用StringIO将字符串视为文件:

>>> from xml.etree import ElementTree
>>> from StringIO import StringIO
>>> text = open('foo.xml', 'r').read()
>>> tree = ElementTree.parse(StringIO(text))
>>> tree.find('//BODY')
<Element BODY at b723cf2c>
>>> tree.find('//BODY').text
'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa would be fit\nfor export as shippers are now experiencing dificulties in\nobtaining +Bahia superior+ certificates.\n    In view of the lower quality over recent weeks farmers have\nsold a good part of their cocoa held on consignment.\n    Comissaria Smith said spot bean prices rose to 340 to 350\ncruzados per arroba of 15 kilos.\n    Bean shippers were reluctant to offer nearby shipment and\nonly limited sales were booked for March shipment at 1,750 to\n1,780 dlrs per tonne to ports to be named.\n    New crop sales were also light and all to open ports with\nJune/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs\nunder New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs\nper tonne FOB.\n    Routine sales of butter were made. March/April sold at\n4,340, 4,345 and 4,350 dlrs.\n    April/May butter went at 2.27 times New York May, June/July\nat 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at\n2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and\n2.27 times New York Dec, Comissaria Smith said.\n    Destinations were the U.S., Covertible currency areas,\nUruguay and open ports.\n    Cake sales were registered at 785 to 995 dlrs for\nMarch/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times\nNew York Dec for Oct/Dec.\n    Buyers were the U.S., Argentina, Uruguay and convertible\ncurrency areas.\n    Liquor sales were limited with March/April selling at 2,325\nand 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New\nYork July, Aug/Sept at 2,400 dlrs and at 1.25 times New York\nSept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith\nsaid.\n    Total Bahia sales are currently estimated at 6.13 mln bags\nagainst the 1986/87 crop and 1.06 mln bags against the 1987/88\ncrop.\n    Final figures for the period to February 28 are expected to\nbe published by the Brazilian Cocoa Trade Commission after\ncarnival which ends midday on February 27.\n Reuter\n'
>>>

I removed the following characters from the XML source to clean it up...

我从XML源中删除了以下字符以进行清理...

(py26_default)[mpenning@Bucksnort ~]$ python fix_xml.py foo.xml
bar.xml
343 &#5;
347 &#5;
351 &#5;
359 &#22;
364 &#22;
369 &#1;
378 &#31;
444 &#2;
3393 &#3;
(py26_default)[mpenning@Bucksnort ~]$

Keep in mind that there are other ways of doing this... lxml.soupparser also cleans up bad XML. Sample lxml.soupparser usage

请记住,还有其他方法可以执行此操作... lxml.soupparser还可以清除错误的XML。示例lxml.soupparser用法

from lxml.html import soupparser
from StringIO import StringIO
try:
    parser = XMLParser(ns_clean=True, recover=True)
    tree = ET.parse(StringIO(text), parser)
except UnicodeDecodeError:
    tree = soupparser.parse(StringIO(text))

Cleaned up xml

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
    In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
    Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
    Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
    New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
    Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
    April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
    Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
    Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
    Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
    Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
    Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
    Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
 Reuter
</BODY></TEXT>
</REUTERS>

John Machin's bad XML finder (fix_xml.py)

As John Machin mentions in this answer, some characters are not valid XML; this is a script he wrote to help find invalid XML characters.

正如John Machin在这个答案中提到的,一些字符不是有效的XML;这是他编写的脚本,用于帮助查找无效的XML字符。

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

#3


0  

I don't know if this will still help you, but I was running into a similar problem and needed to get my xml data into an element tree, not BeautifulSoup or lxml's soupparser. I didn't want to have to make two passes through my xml file either. So, I found out how to build a custom XMLParser for ElementTree (not cElementTree, though). Using some of Mike's code I created an XMLParser class that can intercept the character data and filter out invalid characters before going through the parser.

我不知道这是否仍然可以帮助你,但我遇到了类似的问题,需要将我的xml数据放入元素树,而不是BeautifulSoup或lxml的soupparser。我不想要通过我的xml文件进行两次传递。所以,我发现了如何为ElementTree构建自定义XMLParser(不过cElementTree)。使用Mike的一些代码,我创建了一个XMLParser类,它可以拦截字符数据并在通过解析器之前过滤掉无效字符。

Here you go:

干得好:

import xml.etree.ElementTree as ET
import sys
import re

class MyXMLParser(ET.XMLParser):

    rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")

    def feed(self,data):
        m = self.rx.search(data)
        if m is not None:
            target = m.group(1)
            if target:
                num = int(target)
            else:
                num = int(m.group(2), 16)
            if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
                   or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
                # is invalid xml character, cut it out of the stream
                print 'removing %s' % m.group()
                mstart, mend = m.span()
                mydata = data[:mstart] + data[mend:]
        else:
            mydata = data
        super(MyXMLParser,self).feed(mydata)


parser = MyXMLParser(encoding='utf-8')
xml_filename = sys.argv[1]
xml_etree = ET.parse(xml_filename, parser=parser)

#1


5  

If you don't care about the particular structure of a (potentially messy) XML document and just want to quickly get the contents of a given tag/element, you may want to try the BeautifulSoup module.

如果您不关心(可能是凌乱的)XML文档的特定结构,并且只想快速获取给定标记/元素的内容,您可能需要尝试使用BeautifulSoup模块。

import BeautifulSoup
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(totstring)

body = soup.find("body")

bodytext = body.text

#2


4  

Your first clue could be when you get messages like this...

你的第一个线索可能就是当你得到这样的消息时......

>>> from xml.etree import ElementTree
>>> parse = ElementTree.parse('foo.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 586, in parse
    parser.feed(data)
  File "/usr/lib/python2.6/xml/etree/ElementTree.py", line 1245, in feed
    self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: reference to invalid character number: line 11, column 0
>>>

This error comes from invalid characters in the XML source. You need to clean up the invalid characters (see fix_xml.py at bottom of my answer).

此错误来自XML源中的无效字符。您需要清理无效字符(请参阅我的答案底部的fix_xml.py)。

After you have clean XML, it's pretty easy. You should use StringIO to treat strings as files:

拥有干净的XML之后,它非常简单。您应该使用StringIO将字符串视为文件:

>>> from xml.etree import ElementTree
>>> from StringIO import StringIO
>>> text = open('foo.xml', 'r').read()
>>> tree = ElementTree.parse(StringIO(text))
>>> tree.find('//BODY')
<Element BODY at b723cf2c>
>>> tree.find('//BODY').text
'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standing at almost 6.2 mln there\nare a few hundred thousand bags still in the hands of farmers,\nmiddlemen, exporters and processors.\n    There are doubts as to how much of this cocoa would be fit\nfor export as shippers are now experiencing dificulties in\nobtaining +Bahia superior+ certificates.\n    In view of the lower quality over recent weeks farmers have\nsold a good part of their cocoa held on consignment.\n    Comissaria Smith said spot bean prices rose to 340 to 350\ncruzados per arroba of 15 kilos.\n    Bean shippers were reluctant to offer nearby shipment and\nonly limited sales were booked for March shipment at 1,750 to\n1,780 dlrs per tonne to ports to be named.\n    New crop sales were also light and all to open ports with\nJune/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs\nunder New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs\nper tonne FOB.\n    Routine sales of butter were made. March/April sold at\n4,340, 4,345 and 4,350 dlrs.\n    April/May butter went at 2.27 times New York May, June/July\nat 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at\n2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and\n2.27 times New York Dec, Comissaria Smith said.\n    Destinations were the U.S., Covertible currency areas,\nUruguay and open ports.\n    Cake sales were registered at 785 to 995 dlrs for\nMarch/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times\nNew York Dec for Oct/Dec.\n    Buyers were the U.S., Argentina, Uruguay and convertible\ncurrency areas.\n    Liquor sales were limited with March/April selling at 2,325\nand 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New\nYork July, Aug/Sept at 2,400 dlrs and at 1.25 times New York\nSept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith\nsaid.\n    Total Bahia sales are currently estimated at 6.13 mln bags\nagainst the 1986/87 crop and 1.06 mln bags against the 1987/88\ncrop.\n    Final figures for the period to February 28 are expected to\nbe published by the Brazilian Cocoa Trade Commission after\ncarnival which ends midday on February 27.\n Reuter\n'
>>>

I removed the following characters from the XML source to clean it up...

我从XML源中删除了以下字符以进行清理...

(py26_default)[mpenning@Bucksnort ~]$ python fix_xml.py foo.xml
bar.xml
343 &#5;
347 &#5;
351 &#5;
359 &#22;
364 &#22;
369 &#1;
378 &#31;
444 &#2;
3393 &#3;
(py26_default)[mpenning@Bucksnort ~]$

Keep in mind that there are other ways of doing this... lxml.soupparser also cleans up bad XML. Sample lxml.soupparser usage

请记住,还有其他方法可以执行此操作... lxml.soupparser还可以清除错误的XML。示例lxml.soupparser用法

from lxml.html import soupparser
from StringIO import StringIO
try:
    parser = XMLParser(ns_clean=True, recover=True)
    tree = ET.parse(StringIO(text), parser)
except UnicodeDecodeError:
    tree = soupparser.parse(StringIO(text))

Cleaned up xml

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW   02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE>    SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,
although normal humidity levels have not been restored,
Comissaria Smith said in its weekly review.
    The dry period means the temporao will be late this year.
    Arrivals for the week ended February 22 were 155,221 bags
of 60 kilos making a cumulative total for the season of 5.93
mln against 5.81 at the same stage last year. Again it seems
that cocoa delivered earlier on consignment was included in the
arrivals figures.
    Comissaria Smith said there is still some doubt as to how
much old crop cocoa is still available as harvesting has
practically come to an end. With total Bahia crop estimates
around 6.4 mln bags and sales standing at almost 6.2 mln there
are a few hundred thousand bags still in the hands of farmers,
middlemen, exporters and processors.
    There are doubts as to how much of this cocoa would be fit
for export as shippers are now experiencing dificulties in
obtaining +Bahia superior+ certificates.
    In view of the lower quality over recent weeks farmers have
sold a good part of their cocoa held on consignment.
    Comissaria Smith said spot bean prices rose to 340 to 350
cruzados per arroba of 15 kilos.
    Bean shippers were reluctant to offer nearby shipment and
only limited sales were booked for March shipment at 1,750 to
1,780 dlrs per tonne to ports to be named.
    New crop sales were also light and all to open ports with
June/July going at 1,850 and 1,880 dlrs and at 35 and 45 dlrs
under New York july, Aug/Sept at 1,870, 1,875 and 1,880 dlrs
per tonne FOB.
    Routine sales of butter were made. March/April sold at
4,340, 4,345 and 4,350 dlrs.
    April/May butter went at 2.27 times New York May, June/July
at 4,400 and 4,415 dlrs, Aug/Sept at 4,351 to 4,450 dlrs and at
2.27 and 2.28 times New York Sept and Oct/Dec at 4,480 dlrs and
2.27 times New York Dec, Comissaria Smith said.
    Destinations were the U.S., Covertible currency areas,
Uruguay and open ports.
    Cake sales were registered at 785 to 995 dlrs for
March/April, 785 dlrs for May, 753 dlrs for Aug and 0.39 times
New York Dec for Oct/Dec.
    Buyers were the U.S., Argentina, Uruguay and convertible
currency areas.
    Liquor sales were limited with March/April selling at 2,325
and 2,380 dlrs, June/July at 2,375 dlrs and at 1.25 times New
York July, Aug/Sept at 2,400 dlrs and at 1.25 times New York
Sept and Oct/Dec at 1.25 times New York Dec, Comissaria Smith
said.
    Total Bahia sales are currently estimated at 6.13 mln bags
against the 1986/87 crop and 1.06 mln bags against the 1987/88
crop.
    Final figures for the period to February 28 are expected to
be published by the Brazilian Cocoa Trade Commission after
carnival which ends midday on February 27.
 Reuter
</BODY></TEXT>
</REUTERS>

John Machin's bad XML finder (fix_xml.py)

As John Machin mentions in this answer, some characters are not valid XML; this is a script he wrote to help find invalid XML characters.

正如John Machin在这个答案中提到的,一些字符不是有效的XML;这是他编写的脚本,用于帮助查找无效的XML字符。

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

#3


0  

I don't know if this will still help you, but I was running into a similar problem and needed to get my xml data into an element tree, not BeautifulSoup or lxml's soupparser. I didn't want to have to make two passes through my xml file either. So, I found out how to build a custom XMLParser for ElementTree (not cElementTree, though). Using some of Mike's code I created an XMLParser class that can intercept the character data and filter out invalid characters before going through the parser.

我不知道这是否仍然可以帮助你,但我遇到了类似的问题,需要将我的xml数据放入元素树,而不是BeautifulSoup或lxml的soupparser。我不想要通过我的xml文件进行两次传递。所以,我发现了如何为ElementTree构建自定义XMLParser(不过cElementTree)。使用Mike的一些代码,我创建了一个XMLParser类,它可以拦截字符数据并在通过解析器之前过滤掉无效字符。

Here you go:

干得好:

import xml.etree.ElementTree as ET
import sys
import re

class MyXMLParser(ET.XMLParser):

    rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")

    def feed(self,data):
        m = self.rx.search(data)
        if m is not None:
            target = m.group(1)
            if target:
                num = int(target)
            else:
                num = int(m.group(2), 16)
            if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
                   or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
                # is invalid xml character, cut it out of the stream
                print 'removing %s' % m.group()
                mstart, mend = m.span()
                mydata = data[:mstart] + data[mend:]
        else:
            mydata = data
        super(MyXMLParser,self).feed(mydata)


parser = MyXMLParser(encoding='utf-8')
xml_filename = sys.argv[1]
xml_etree = ET.parse(xml_filename, parser=parser)