在java中解析非xml文件

时间:2022-12-01 13:01:49

I want to parse a document that is not pure xml. For example

我想要解析一个非纯xml的文档。例如

my name is <j> <b> mike</b>  </j>

example 2

示例2

 my name is  <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>

Means my input is not pure xml. ITs simliar to html but the tags are not html. How can i parse it in java?

意味着我的输入不是纯xml。它对html很简单,但标签不是html。如何用java解析它?

2 个解决方案

#1


5  

Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)

您的示例是有效的XML,除了缺少文档元素。如果您知道这种情况一直存在,那么您可以将一组假标记包装在整个过程中,并使用标准解析器(SAX、DOM…)

On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)

另一方面,如果你得到了一些更丑的东西(比如标签不匹配,或者以重叠的方式分隔开来),你将不得不做一些定制,这将涉及到许多你必须决定的规则,这些规则对于你的应用来说是独一无二的。我如何处理没有结束的开始标签?如果结束标记在父标记之外,我该怎么办?)

#2


0  

There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

很少有解析器不太好地生成html并将其转换成格式良好的xml,这里有一些与示例的比较,其中包括最流行的,除了HTMLParser。也许这就是你所需要的。

#1


5  

Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)

您的示例是有效的XML,除了缺少文档元素。如果您知道这种情况一直存在,那么您可以将一组假标记包装在整个过程中,并使用标准解析器(SAX、DOM…)

On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)

另一方面,如果你得到了一些更丑的东西(比如标签不匹配,或者以重叠的方式分隔开来),你将不得不做一些定制,这将涉及到许多你必须决定的规则,这些规则对于你的应用来说是独一无二的。我如何处理没有结束的开始标签?如果结束标记在父标记之外,我该怎么办?)

#2


0  

There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

很少有解析器不太好地生成html并将其转换成格式良好的xml,这里有一些与示例的比较,其中包括最流行的,除了HTMLParser。也许这就是你所需要的。