如何使用Java和SAX解析带有偶然XML标记的纯文本文件?

时间:2022-10-30 00:27:50

I have a rather large log file from a server which contains plain text. The server logs every thing it does and occasionally it prints xml tags which I am interested in parsing. To give you an example:

我有一个来自服务器的相当大的日志文件,其中包含纯文本。服务器记录它所做的每件事,偶尔会打印出我感兴趣解析的xml标签。举个例子:

-----------log file-------------
bla bla bla random text
<logMessage>test Message</logMessage>
some more random server output
<logMessage>some other message</logMessage>
bla bla bla
end of log file

I just want to extract the data from the < logMessage > tags and ignore the rest. I am using Java and SAX, but the SAX parser expects the content of the file to be strictly XML formatted and it cannot handle this type of file. Is there a way to tell SAX to ignore/overlook the fact that the file is not a well formatted XML? What's the alternative? read the file line by line and look for the tags? :(

我只是想从 标签中提取数据而忽略其余的数据。我正在使用Java和SAX,但是SAX解析器期望文件的内容严格地是XML格式的,并且它不能处理这种类型的文件。有没有办法告诉SAX忽略/忽略文件格式不正确的XML这一事实?有什么选择?逐行读取文件并查找标签? :(

2 个解决方案

#1


1  

For simplicity's sake I would opt for reading the file line by line and looking for <logMessage> and </logMessage> tokens. Note that you can make a generic parser of that kind which takes a delegate parser and feeds it SAX-like events. (May be useful depending on how much work it would otherwise be to rewrite parsers, now your SAX based solution turns out to not work.)

为简单起见,我会选择逐行读取文件并查找 和 标记。请注意,您可以创建这种类型的通用解析器,它使用委托解析器并为其提供类似SAX的事件。 (根据重写解析器的工作量可能很有用,现在基于SAX的解决方案无法正常工作。)

EDIT: The delegate approach is also useful if you are interested in more than one kind of element. If these happen to have complex (embedded) XML hierarchies, you could even collate all the characters in between the opening and closing tokens into a buffer, then feed that buffer to a real SAX parser. This would be overkill in most cases, but again, if you have logs which essentially contains XML dumps it might be more suitable than trying to parse it all yourself.

编辑:如果您对多种元素感兴趣,则委托方法也很有用。如果这些碰巧具有复杂(嵌入式)XML层次结构,您甚至可以将开始和结束标记之间的所有字符整理到缓冲区中,然后将该缓冲区提供给真正的SAX解析器。在大多数情况下,这将是一种过度杀伤,但是,如果你有基本上包含XML转储的日志,那么它可能比尝试自己解析它更合适。

#2


0  

I don't think straight XML parsing would be appropriate for parsing this sort of file. If all XML snippets are contained in the line (opening and closing tags are on the same line) then reading it line by line and checking for presence of tags, skipping non-XML lines would be simplest way to do it. After you skipped non-XML lines you could pass stream for processing to SAX parser, or just use regexp on line-by-line basis.

我认为直接XML解析不适合解析这种文件。如果行中包含所有XML片段(开始和结束标记位于同一行),则逐行读取并检查是否存在标记,跳过非XML行将是最简单的方法。在跳过非XML行之后,您可以将流传递给SAX解析器,或者只是逐行使用regexp。

Essentially above approach is identical to grepping file first to leave only XML tags, then wrap it in root element to make well formed XML and parse it.

基本上上面的方法与首先grepping文件只留下XML标签相同,然后将它包装在根元素中以制作格式良好的XML并解析它。

#1


1  

For simplicity's sake I would opt for reading the file line by line and looking for <logMessage> and </logMessage> tokens. Note that you can make a generic parser of that kind which takes a delegate parser and feeds it SAX-like events. (May be useful depending on how much work it would otherwise be to rewrite parsers, now your SAX based solution turns out to not work.)

为简单起见,我会选择逐行读取文件并查找 和 标记。请注意,您可以创建这种类型的通用解析器,它使用委托解析器并为其提供类似SAX的事件。 (根据重写解析器的工作量可能很有用,现在基于SAX的解决方案无法正常工作。)

EDIT: The delegate approach is also useful if you are interested in more than one kind of element. If these happen to have complex (embedded) XML hierarchies, you could even collate all the characters in between the opening and closing tokens into a buffer, then feed that buffer to a real SAX parser. This would be overkill in most cases, but again, if you have logs which essentially contains XML dumps it might be more suitable than trying to parse it all yourself.

编辑:如果您对多种元素感兴趣,则委托方法也很有用。如果这些碰巧具有复杂(嵌入式)XML层次结构,您甚至可以将开始和结束标记之间的所有字符整理到缓冲区中,然后将该缓冲区提供给真正的SAX解析器。在大多数情况下,这将是一种过度杀伤,但是,如果你有基本上包含XML转储的日志,那么它可能比尝试自己解析它更合适。

#2


0  

I don't think straight XML parsing would be appropriate for parsing this sort of file. If all XML snippets are contained in the line (opening and closing tags are on the same line) then reading it line by line and checking for presence of tags, skipping non-XML lines would be simplest way to do it. After you skipped non-XML lines you could pass stream for processing to SAX parser, or just use regexp on line-by-line basis.

我认为直接XML解析不适合解析这种文件。如果行中包含所有XML片段(开始和结束标记位于同一行),则逐行读取并检查是否存在标记,跳过非XML行将是最简单的方法。在跳过非XML行之后,您可以将流传递给SAX解析器,或者只是逐行使用regexp。

Essentially above approach is identical to grepping file first to leave only XML tags, then wrap it in root element to make well formed XML and parse it.

基本上上面的方法与首先grepping文件只留下XML标签相同,然后将它包装在根元素中以制作格式良好的XML并解析它。