如何使用java中的正则表达式从HTML页面中提取数据

时间:2022-10-29 13:28:55

I'm trying to extract data from an html page as to store them in a String array

我正在尝试从html页面中提取数据,以便将它们存储在String数组中

In the HTML page values are shown like this

在HTML页面中,值显示如下

 <tbody>
                      <tr>
                        <td style="width: 14%;">88055</td>
                        <td style="width: 19%;" class="gris">Ville</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/88055/" >Amos</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/880/" >Abitibi</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">85080</td>
                        <td style="width: 19%;" class="gris">Village</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/85080/" >Angliers</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/850/" >Témiscamingue</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">87050</td>
                        <td style="width: 19%;" class="gris">Municipalité</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/87050/" >Authier</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/870/" >Abitibi-Ouest</a></td>
                      </tr>

I need to extract only the string where the href = Municipality

我只需要提取href = Municipality的字符串

witch means Amos ,Angliers , etc... and store them into an array of string

巫婆意味着Amos,Angliers等......并将它们存储在一个字符串数组中

So far I have tried this and I'm lost

到目前为止,我已经尝试了这一点,我迷路了

  public static final String EXPRESSION = "";//How to write the regex expression?
String [] data = new String [20]
    URL url = new URL("http://myur.com");


 BufferedReader in  = new BufferedReader(new InputStreamReader(url.openStream()));

        while ((ligne = in.readLine()) != null) {
          //What to write here? 
            }
            in.close();

P.S : I'm aware the best method is to use an HTML parser instead but I'm really forced to apply this way

P.S:我知道最好的方法是使用HTML解析器,但我真的*采用这种方式

Much appreciation ,

非常感谢,

Bass

3 个解决方案

#1


1  

You can use something like this to hardcode match the url having municipalite and get the text inside wrt to > and < characters.

您可以使用类似的东西来硬编码匹配具有市政的网址,并将文本内部的文本转换为>和 <字符。< p>

This is my data file:

这是我的数据文件:

 <tbody>
                      <tr>
                        <td style="width: 14%;">88055</td>
                        <td style="width: 19%;" class="gris">Ville</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/88055/" >Amos</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/880/" >Abitibi</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">85080</td>
                        <td style="width: 19%;" class="gris">Village</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/85080/" >Angliers</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/850/" >Témiscamingue</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">87050</td>
                        <td style="width: 19%;" class="gris">Municipalité</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/87050/" >Authier</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/870/" >Abitibi-Ouest</a></td>
                      </tr>

Here is the java file:

这是java文件:

import java.util.*;
import java.util.regex.*;
import java.lang.*;
import java.io.*;

class test
{
    public static void main (String[] args) throws java.lang.Exception
    {
        BufferedReader in  = new BufferedReader(new FileReader(new File("data")));
        String line="";
        Pattern p=Pattern.compile("href\\s*=\\s*(?:\"|').*municipalite/[^>]*>(?:<.*>)*([^<]*)<.*$");
        while ((line = in.readLine()) != null)
        {
            Matcher m=p.matcher(line);
            while(m.find())
                System.out.println(m.group(1)); 
        }
        in.close();
    }
}

Output:

$ javac test.java 
$ java test 
Amos
Angliers
Authier
$

Regular expression breakdown:

正则表达式细分:

href\\s*=\\s*(?:\"|').*municipalite/[^>]*>(?:<.*>)*([^<]*)<.*$
  1. href\\s*=\\s* matches href following by 0 or more spaces followed by = and then 0 or more spaces

    href \\ s * = \\ s *匹配href后跟0或更多空格后跟=然后是0或更多空格

  2. (?:\"|') -> (?:) means a non capturing group i.e it matches single or double quotes but doesn't capture/remember it

    (?:\“|') - >(?:)表示非捕获组,即它匹配单引号或双引号但不捕获/记住它

  3. .*municipalite/ matches any char till municipalite/ occurs

    。* cityite /匹配任何炭,直到cityite /发生

  4. [^>]*>(?:<.*>)* matches any char that is not a > for the rest of the url and then matches >, then tries to match 0 or more (all optional) opening tags into a non capturing group using this (?:<.*>)

    [^>] *>(?:<。*>)*匹配任何不是>的其余字符的char,然后匹配>,然后尝试将0或更多(所有可选的)开始标记匹配为非使用此捕获组(?:<。*>)

  5. ([^<]*) this group actually captures your string into group 1

    ([^ <] *)此组实际上将您的字符串捕获到组1中

  6. <.*$ matches the rest of the line

    <。* $匹配该行的其余部分

#2


1  

I have shown in python. But the regex is the same in Java, I believe. Use Java functions to find the matches.

我在python中展示过。但我相信,正则表达式在Java中是相同的。使用Java函数查找匹配项。

import re
reg = r"<a href=.*?municipalite.*?>(.+?)</a>"
result = re.findall(html)

#3


1  

Try ".*\\bhref=\"repertoire-des-municipalites/fiche/municipalite/\\d+/\"[^>]*>([^<]*)<.*"

My demo code (below) gives console output:

我的演示代码(如下)给出了控制台输出:

Console Output

Amos
Angliers
Authier

Demo Code

public class HrefRegex
{
    public static void main(final String[] args)
    {
        final String[] sampleLines = new String[] {
            "  </tr>",
            "    <td style=\"width: 14%;\">88055</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Ville</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/88055/\" >Amos</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/880/\" >Abitibi</a></td>",
            "  </tr>",
            "  <tr>",
            "    <td style=\"width: 14%;\">85080</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Village</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/85080/\" >Angliers</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/850/\" >Témiscamingue</a></td>",
            "  </tr>",
            "  <tr>",
            "    <td style=\"width: 14%;\">87050</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Municipalité</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/87050/\" >Authier</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/870/\" >Abitibi-Ouest</a></td>",
            "  </tr>",
          };


        final Pattern pattern = Pattern.compile(".*\\bhref=\"repertoire-des-municipalites/fiche/municipalite/\\d+/\"[^>]*>([^<]*)<.*");

        for (final String s : sampleLines)
        {
            final Matcher matcher = pattern.matcher(s);

            if (matcher.matches())
            {
                System.out.println(matcher.group(1));
            }
        }
    }
}

#1


1  

You can use something like this to hardcode match the url having municipalite and get the text inside wrt to > and < characters.

您可以使用类似的东西来硬编码匹配具有市政的网址,并将文本内部的文本转换为>和 <字符。< p>

This is my data file:

这是我的数据文件:

 <tbody>
                      <tr>
                        <td style="width: 14%;">88055</td>
                        <td style="width: 19%;" class="gris">Ville</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/88055/" >Amos</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/880/" >Abitibi</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">85080</td>
                        <td style="width: 19%;" class="gris">Village</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/85080/" >Angliers</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/850/" >Témiscamingue</a></td>
                      </tr>
                      <tr>
                        <td style="width: 14%;">87050</td>
                        <td style="width: 19%;" class="gris">Municipalité</td>
                        <td style="width: 33%;"><a href="repertoire-des-municipalites/fiche/municipalite/87050/" >Authier</a></td>
                        <td style="width: 34%;"><a href="repertoire-des-municipalites/fiche/mrc/870/" >Abitibi-Ouest</a></td>
                      </tr>

Here is the java file:

这是java文件:

import java.util.*;
import java.util.regex.*;
import java.lang.*;
import java.io.*;

class test
{
    public static void main (String[] args) throws java.lang.Exception
    {
        BufferedReader in  = new BufferedReader(new FileReader(new File("data")));
        String line="";
        Pattern p=Pattern.compile("href\\s*=\\s*(?:\"|').*municipalite/[^>]*>(?:<.*>)*([^<]*)<.*$");
        while ((line = in.readLine()) != null)
        {
            Matcher m=p.matcher(line);
            while(m.find())
                System.out.println(m.group(1)); 
        }
        in.close();
    }
}

Output:

$ javac test.java 
$ java test 
Amos
Angliers
Authier
$

Regular expression breakdown:

正则表达式细分:

href\\s*=\\s*(?:\"|').*municipalite/[^>]*>(?:<.*>)*([^<]*)<.*$
  1. href\\s*=\\s* matches href following by 0 or more spaces followed by = and then 0 or more spaces

    href \\ s * = \\ s *匹配href后跟0或更多空格后跟=然后是0或更多空格

  2. (?:\"|') -> (?:) means a non capturing group i.e it matches single or double quotes but doesn't capture/remember it

    (?:\“|') - >(?:)表示非捕获组,即它匹配单引号或双引号但不捕获/记住它

  3. .*municipalite/ matches any char till municipalite/ occurs

    。* cityite /匹配任何炭,直到cityite /发生

  4. [^>]*>(?:<.*>)* matches any char that is not a > for the rest of the url and then matches >, then tries to match 0 or more (all optional) opening tags into a non capturing group using this (?:<.*>)

    [^>] *>(?:<。*>)*匹配任何不是>的其余字符的char,然后匹配>,然后尝试将0或更多(所有可选的)开始标记匹配为非使用此捕获组(?:<。*>)

  5. ([^<]*) this group actually captures your string into group 1

    ([^ <] *)此组实际上将您的字符串捕获到组1中

  6. <.*$ matches the rest of the line

    <。* $匹配该行的其余部分

#2


1  

I have shown in python. But the regex is the same in Java, I believe. Use Java functions to find the matches.

我在python中展示过。但我相信,正则表达式在Java中是相同的。使用Java函数查找匹配项。

import re
reg = r"<a href=.*?municipalite.*?>(.+?)</a>"
result = re.findall(html)

#3


1  

Try ".*\\bhref=\"repertoire-des-municipalites/fiche/municipalite/\\d+/\"[^>]*>([^<]*)<.*"

My demo code (below) gives console output:

我的演示代码(如下)给出了控制台输出:

Console Output

Amos
Angliers
Authier

Demo Code

public class HrefRegex
{
    public static void main(final String[] args)
    {
        final String[] sampleLines = new String[] {
            "  </tr>",
            "    <td style=\"width: 14%;\">88055</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Ville</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/88055/\" >Amos</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/880/\" >Abitibi</a></td>",
            "  </tr>",
            "  <tr>",
            "    <td style=\"width: 14%;\">85080</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Village</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/85080/\" >Angliers</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/850/\" >Témiscamingue</a></td>",
            "  </tr>",
            "  <tr>",
            "    <td style=\"width: 14%;\">87050</td>",
            "    <td style=\"width: 19%;\" class=\"gris\">Municipalité</td>",
            "    <td style=\"width: 33%;\"><a href=\"repertoire-des-municipalites/fiche/municipalite/87050/\" >Authier</a></td>",
            "    <td style=\"width: 34%;\"><a href=\"repertoire-des-municipalites/fiche/mrc/870/\" >Abitibi-Ouest</a></td>",
            "  </tr>",
          };


        final Pattern pattern = Pattern.compile(".*\\bhref=\"repertoire-des-municipalites/fiche/municipalite/\\d+/\"[^>]*>([^<]*)<.*");

        for (final String s : sampleLines)
        {
            final Matcher matcher = pattern.matcher(s);

            if (matcher.matches())
            {
                System.out.println(matcher.group(1));
            }
        }
    }
}