关于使用Regex和Java解析HTML的问题

时间:2021-12-11 01:43:16

I Have a question about finding html tags using Java and Regex.

我有一个关于使用Java和Regex查找html标签的问题。

I am using the code below to find all the tags in HTML, documentURL is obviously the HTML content.

我使用下面的代码来查找HTML中的所有标签,documentURL显然是HTML内容。

The find method return true, meaning that it can find something in the HTML but the matches() method always return false and I am completly and utterly puzzled about this.

find方法返回true,这意味着它可以在HTML中找到一些东西但是matches()方法总是返回false,而且我完全对此感到困惑。

I refered to Java documentations too but could not find my answer.

我也提到了Java文档,但找不到我的答案。

What is the correct way of using Matcher ?

使用Matcher的正确方法是什么?

    Pattern keyLineContents = Pattern.compile("(<.*?>)");

    Matcher keyLineMatcher = keyLineContents.matcher(documentURL);

    boolean result = keyLineMatcher.find();

    boolean matchFound = keyLineMatcher.matches();

Doing something like this throws an exeption:

做这样的事情会引发一个例外:

     String abc = keyLineMatcher.group(0);

Thanks.

3 个解决方案

#1


7  

The correct way to loop through matches is:

循环匹配的正确方法是:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

话虽这么说,正则表达式是一种解析HTML的极差方法。原因归结为:正则表达式适用于解析常规语言。 HTML是一种无上下文的语言。正则表达式落后的地方是嵌套标签,使用>内部属性值等等。

Use a dedicated HTML parser instead such as HTML Parser.

使用专用的HTML解析器,例如HTML Parser。

#2


2  

Why don't you try looking at the source code of some open source HTML Parsers? HtmlCleaner, Tagsoup etc.

你为什么不试着看一些开源HTML Parsers的源代码? HtmlCleaner,Tagsoup等

The general strategy seems to be to attempt to parse and clean the html and return an Xml tree.

一般策略似乎是尝试解析和清理html并返回一个Xml树。

Personally, I would read through the HTML adding opening tags to a LIFO Queue, and removing (matching) opening tags from the start of the queue when a closing tag is encountered - performing queue shifting to allow for tag mismatches.

就个人而言,我会通过HTML向LIFO队列添加开始标记,并在遇到结束标记时从队列的开头删除(匹配)开始标记 - 执行队列转移以允许标记不匹配。

#3


1  

I want to get keyword content from HTML tag I wrote:

我想从我写的HTML标签中获取关键字内容:

Pattern keyLineContents = Pattern.compile("<(.[^<]*)(keywords)(.[^<]*)>");
Matcher keyLineMatcher = keyLineContents.matcher(documentURL);
boolean result = keyLineMatcher.find();
while(result)
{
  String metaTagContent = keyLineMatcher.group(1) + " " + keyLineMatcher.group(3);
  Pattern kcontent = Pattern.compile("(.*?content=\")(.[^<]*?)(\".[^<]*?)");
  Matcher keyLineMatcher2 = kcontent.matcher(metaTagContent);
  boolean result2 = keyLineMatcher.find();
  while (result2)
  {
    String metaTagContent2 = keyLineMatcher.group(1);
    result2 = keyLineMatcher.find();
  }
}

But I don't understand why my result2 is false. Result one is fine give all content of keyword tag

但我不明白为什么我的result2是假的。结果一个很好给出关键字标签的所有内容

thanks

#1


7  

The correct way to loop through matches is:

循环匹配的正确方法是:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

话虽这么说,正则表达式是一种解析HTML的极差方法。原因归结为:正则表达式适用于解析常规语言。 HTML是一种无上下文的语言。正则表达式落后的地方是嵌套标签,使用>内部属性值等等。

Use a dedicated HTML parser instead such as HTML Parser.

使用专用的HTML解析器,例如HTML Parser。

#2


2  

Why don't you try looking at the source code of some open source HTML Parsers? HtmlCleaner, Tagsoup etc.

你为什么不试着看一些开源HTML Parsers的源代码? HtmlCleaner,Tagsoup等

The general strategy seems to be to attempt to parse and clean the html and return an Xml tree.

一般策略似乎是尝试解析和清理html并返回一个Xml树。

Personally, I would read through the HTML adding opening tags to a LIFO Queue, and removing (matching) opening tags from the start of the queue when a closing tag is encountered - performing queue shifting to allow for tag mismatches.

就个人而言,我会通过HTML向LIFO队列添加开始标记,并在遇到结束标记时从队列的开头删除(匹配)开始标记 - 执行队列转移以允许标记不匹配。

#3


1  

I want to get keyword content from HTML tag I wrote:

我想从我写的HTML标签中获取关键字内容:

Pattern keyLineContents = Pattern.compile("<(.[^<]*)(keywords)(.[^<]*)>");
Matcher keyLineMatcher = keyLineContents.matcher(documentURL);
boolean result = keyLineMatcher.find();
while(result)
{
  String metaTagContent = keyLineMatcher.group(1) + " " + keyLineMatcher.group(3);
  Pattern kcontent = Pattern.compile("(.*?content=\")(.[^<]*?)(\".[^<]*?)");
  Matcher keyLineMatcher2 = kcontent.matcher(metaTagContent);
  boolean result2 = keyLineMatcher.find();
  while (result2)
  {
    String metaTagContent2 = keyLineMatcher.group(1);
    result2 = keyLineMatcher.find();
  }
}

But I don't understand why my result2 is false. Result one is fine give all content of keyword tag

但我不明白为什么我的result2是假的。结果一个很好给出关键字标签的所有内容

thanks