正则表达式使用src,width,height进行标记解析

时间:2022-11-27 18:23:27

You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.

你可能会对这种说法做出反应,即使用正则表达式进行HTML解析是一个非常糟糕的主意,例如,你是对的。

But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.

但在我的情况下,下面的html节点是由我们自己的服务器创建的,所以我们知道它总是这样,并且由于正则表达式将在移动android库中,我不想使用像Jsoup这样的库。

What I want to parse: <img src="myurl.jpg" width="12" height="32">

我要解析的内容:正则表达式使用src,width,height进行标记解析

What should be parsed:

应该解析什么:

  • match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
  • 匹配一个常规的img标签,并对src属性值进行分组:正则表达式使用src,width,height进行标记解析] + src \\ s * = \\ s * ['\“]([^'\”] +)['\“] [ ^>] *>

  • width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*
  • width和height属性值:(width | height)\ s * = \ s * ['“]([^'”] *)['“] *

So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.

因此,第一个正则表达式将具有带有img url的#1组,而第二个正则表达式将具有两个与其值的子组匹配的匹配。

How can I merge both?

我如何合并两者?

Desired output:

  • img url
  • width value
  • height value

3 个解决方案

#1


2  

To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

要将任何img标记与src,height和width属性匹配,这些属性可以以任何顺序出现并且实际上是可选的,您可以使用

"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"

See the regex demo and an IDEONE Java demo:

查看正则表达式演示和IDEONE Java演示:

String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
        System.out.println("\n--- NEW MATCH ---");  
    }
    System.out.println(matcher.group(2) + ": " + matcher.group(4));
} 

The regex details:

正则表达式详细信息:

  • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
  • (正则表达式使用src,width,height进行标记解析标签开始匹配的初始边界或上一次成功匹配的结束

  • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag) -\\b(src|width|height)= - a whole word src=, width= or height=
  • [^>] *? - 匹配我们不感兴趣的任何可选属性(0 +以外的字符以便留在标签内) - \\ b(src | width | height)= - 整个单词src =,width =或height =

  • ([\"']?) - a technical 3rd group to check the attribute value delimiter
  • ([\“']?) - 检查属性值分隔符的技术第3组

  • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
  • ([^>] *?) - 包含属性值的组4(除了a>之外的0+个字符,尽可能少到第一个

  • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)
  • \\ 3 - 属性值分隔符与组3匹配(如果分隔符可能为空,请注意,在模式的末尾添加(?= \\ s | /?>))

The logic:

  • Match the start of img tag
  • 匹配img标签的开头

  • Then, match everything that is inside, but only capture the attributes we need
  • 然后,匹配内部的所有内容,但只捕获我们需要的属性

  • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
  • 由于我们将要有多个匹配,而不是组,我们需要为每个新的img标记找到边界。这是通过检查第一个组是否为空(if(!matcher.group(1).isEmpty())来完成的。

  • All there remains to do is to add a list for keeping matches.
  • 剩下要做的就是添加一个列表以保持匹配。

#2


1  

You may want this :

你可能想要这个:

"(?i)(src|width|height)=\"(.*?)\""


Update:

I misunderstood your question, you need something like :

我误解了你的问题,你需要这样的东西:

"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"

Regex101 Demo


Update 2

The regex below will capture the img tag attributes in any order:

下面的正则表达式将以任何顺序捕获img标记属性:

"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"  

Regex101 Demo v2

Regex101演示版v2

#3


0  

If you want to combine the both the things here is the answer.

如果你想把这两者结合起来就是答案。

<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"

sample I tested

我测试的样品

<img src="rakesh.jpg" width="25" height="45">

try this

#1


2  

To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

要将任何img标记与src,height和width属性匹配,这些属性可以以任何顺序出现并且实际上是可选的,您可以使用

"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"

See the regex demo and an IDEONE Java demo:

查看正则表达式演示和IDEONE Java演示:

String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
        System.out.println("\n--- NEW MATCH ---");  
    }
    System.out.println(matcher.group(2) + ": " + matcher.group(4));
} 

The regex details:

正则表达式详细信息:

  • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
  • (正则表达式使用src,width,height进行标记解析标签开始匹配的初始边界或上一次成功匹配的结束

  • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag) -\\b(src|width|height)= - a whole word src=, width= or height=
  • [^>] *? - 匹配我们不感兴趣的任何可选属性(0 +以外的字符以便留在标签内) - \\ b(src | width | height)= - 整个单词src =,width =或height =

  • ([\"']?) - a technical 3rd group to check the attribute value delimiter
  • ([\“']?) - 检查属性值分隔符的技术第3组

  • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
  • ([^>] *?) - 包含属性值的组4(除了a>之外的0+个字符,尽可能少到第一个

  • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)
  • \\ 3 - 属性值分隔符与组3匹配(如果分隔符可能为空,请注意,在模式的末尾添加(?= \\ s | /?>))

The logic:

  • Match the start of img tag
  • 匹配img标签的开头

  • Then, match everything that is inside, but only capture the attributes we need
  • 然后,匹配内部的所有内容,但只捕获我们需要的属性

  • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
  • 由于我们将要有多个匹配,而不是组,我们需要为每个新的img标记找到边界。这是通过检查第一个组是否为空(if(!matcher.group(1).isEmpty())来完成的。

  • All there remains to do is to add a list for keeping matches.
  • 剩下要做的就是添加一个列表以保持匹配。

#2


1  

You may want this :

你可能想要这个:

"(?i)(src|width|height)=\"(.*?)\""


Update:

I misunderstood your question, you need something like :

我误解了你的问题,你需要这样的东西:

"(?i)<img\\s+src=\"(.*?)\"\\s+width=\"(.*?)\"\\s+height=\"(.*?)\">"

Regex101 Demo


Update 2

The regex below will capture the img tag attributes in any order:

下面的正则表达式将以任何顺序捕获img标记属性:

"(?i)(?><img\\s+)src=\"(.*?)\"|width=\"(.*?)\"|height=\"(.*?)\">"  

Regex101 Demo v2

Regex101演示版v2

#3


0  

If you want to combine the both the things here is the answer.

如果你想把这两者结合起来就是答案。

<img\s+src="([^"]+)"\s+width="([^"]+)"\s+height="([^"]+)"

sample I tested

我测试的样品

<img src="rakesh.jpg" width="25" height="45">

try this