在Java中,如何从特定的正则表达式中创建所有可能数字的列表?

时间:2023-02-09 20:24:39

I have a strange problem, at least one I've never come across. I have a precondition where customers have simple regular expressions associated with labels. The labels are all they care about. What I would like to do is create a list of all possible numbers that would match each of these regular expressions. I would have logic that would warn me when the list is beyond a certain threshold.

我有一个奇怪的问题,至少有一个我从没遇到过。我有一个前提,即客户有与标签相关的简单正则表达式。这些标签是他们所关心的。我要做的是创建一个列表,列出所有可能的数字,这些数字将与这些正则表达式匹配。我的逻辑会在列表超过某个阈值时提醒我。

Here is an example of the regular expression: 34.25.14.(227|228|229|230|243|244|245|246)

这里有一个正则表达式的例子:34.25.14。(227|228|229|230|243|244|245|246)

Lets say that these ip(s) are associated with ACME. Behind the scenes when the user selects ACME (in our UI), I'm filling out a filter object that contains all of those possible numbers and submitting them as an OR query to a highly specialized Vertica database.

假设这些ip与ACME相关联。在用户选择ACME(在我们的UI中)时,我将填写一个filter对象,该对象包含所有可能的数字,并将它们作为OR查询提交给高度专门化的Vertica数据库。

I just can't determine an elegant way of creating a list of numbers from said regular expressions.

我无法确定从上述正则表达式创建数字列表的优雅方式。

The others aspect of this, is that the java code within another portion of the product is using those regular expressions to show ACME by using a java Pattern.compile(), which means the customer 'could' create a complex regular expression. I've only seen them, so far, use something simple as shown above.

另一个方面是,产品的另一部分中的java代码使用这些正则表达式,通过使用java Pattern.compile()来显示ACME,这意味着客户“可以”创建一个复杂的正则表达式。到目前为止,我只看到过它们使用简单的东西,如上面所示。

Is there a method that will generate a list based on regular expression?

是否有一种方法可以根据正则表达式生成列表?

Thanks for your time.

谢谢你的时间。

5 个解决方案

#1


6  

Related:

相关:

A library that generates data matching a regular expression (with limitations): http://code.google.com/p/xeger/

生成匹配正则表达式(有限制)的数据的库:http://code.google.com/p/xeger/

Several solutions, such as conversing a regex into a grammar: Using Regex to generate Strings rather than match them

一些解决方案,例如将regex转换为语法:使用正则表达式生成字符串而不是匹配字符串。


EDIT: Actually, you can make it work!!! The only thing to address is to impose some domain-specific constraints to preclude combinatorial explosion like a+.

编辑:实际上,你可以让它工作!!!唯一要解决的问题是强加一些特定于领域的约束来阻止组合爆炸,比如a+。

If you add to the Xeger class something like this:

如果你给Xeger类添加如下内容:

public void enumerate() {
    System.out.println("enumerate: \"" + regex + "\"");
    int level = 0;
    String accumulated = "";
    enumerate(level, accumulated, automaton.getInitialState());
}

private void enumerate(int level, String accumulated, State state) {
    List<Transition> transitions = state.getSortedTransitions(true);
    if (state.isAccept()) {
        System.out.println(accumulated);
        return;
    }
    if (transitions.size() == 0) {
        assert state.isAccept();
        return;
    }
    int nroptions = state.isAccept() ? transitions.size() : transitions.size() - 1;
    for (int option = 0; option <= nroptions; option++) {
        // Moving on to next transition
        Transition transition = transitions.get(option - (state.isAccept() ? 1 : 0));
        for (char choice = transition.getMin(); choice <= transition.getMax(); choice++) {
            enumerate(level + 1, accumulated + choice, transition.getDest());
        }
    }
}

... and something like this to XegerTest:

…像这样对XegerTest说:

@Test
public void enumerateAllVariants() {
    //String regex = "[ab]{4,6}c";
    String regex = "34\\.25\\.14\\.(227|228|229|230|243|244|245|246)";
    Xeger generator = new Xeger(regex);
    generator.enumerate();
}

... you will get this:

…您将得到:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running nl.flotsam.xeger.XegerTest
enumerate: "34\.25\.14\.(227|228|229|230|243|244|245|246)"
34.25.14.227
34.25.14.228
34.25.14.229
34.25.14.243
34.25.14.244
34.25.14.245
34.25.14.246
34.25.14.230
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.114 sec

... and, guess what. For "[ab]{4,6}c" it correctly produces 112 variants.

…猜猜看。对于“[ab]{4,6}c”,它正确地生成112个变体。

It's really a quick and dirty experiment, but it seems to work ;).

这确实是一项快速而又肮脏的实验,但似乎行得通;

#2


0  

I'd say the technical answer is no, since you can specify in a regex that a character (digit) occurs zero or more times. That "or more" can mean an arbitrarily large number of digits. In practice you might be able to limit the length of the strings and recursively build out a superset of Strings based on the chars you find in the regex and then text them to create your subset list.

我认为技术上的答案是否定的,因为您可以在regex中指定一个字符(数字)出现的次数为零或更多。这个“或更多”可以表示任意多的数字。在实践中,您可能能够限制字符串的长度,并基于在regex中找到的字符递归地构建一组字符串,然后将它们文本化以创建子集列表。

#3


0  

Comment: regular expression is a recursively enumerable language, therefore there exists an algorithm than can enumerate all valid strings.

注释:正则表达式是一种递归可枚举语言,因此存在一种可以枚举所有有效字符串的算法。

Even for a more complex language like Java, there can be an algorithm that enumerates all Java programs; no matter how your particular Java program is written, that algorithm will, in finite time, output a Java program that exactly matches yours.

即使对于像Java这样更复杂的语言,也可以使用一种算法来枚举所有Java程序;无论如何编写特定的Java程序,该算法都会在有限的时间内输出与您的程序完全匹配的Java程序。

But not all languages are enumerable like this; then they are the languages that we know exist but we cannot speak in. They forever elude any primitive intelligence like ours that's but a turing machine.

但并不是所有的语言都是这样列举的;那么它们就是我们知道存在但我们不能说的语言。他们永远逃避不了任何原始的智能,就像我们的,那只是图灵机器。

#4


0  

Brute force, Multithreaded, CPU load 100%:

蛮力,多线程,CPU负载100%:

import java.util.regex.Pattern;

public class AllIP
{
   public static void main( String[] args )
   {
      final Pattern p = Pattern.compile( "34.25.14.(227|228|229|230|243|244|245|246)" );

      int step = 256 / Runtime.getRuntime().availableProcessors();
      for( int range = 0; range < 256; range += step )
      {
         final int from = range;
         final int to   = range + step;
         new Thread(){
            public @Override void run(){
               long atStart = System.currentTimeMillis();
               for( int i = from; i < to ; ++i )
                  for( int j = 1; j < 255; ++j )
                     for( int k = 1; k < 255; ++k )
                        for( int l = 1; l < 255; ++l )
                        {
                           String ip = String.format( "%d.%d.%d.%d", i,j,k,l );
                           if( p.matcher( ip ).matches())
                           {
                              System.out.println( ip );
                           }
                        }
               System.out.println( System.currentTimeMillis() - atStart );
            }
         }.start();
      }
   }
}

Just for fun...

只是为了好玩……

  • 34.25.14.227
  • 34.25.14.227
  • 34.25.14.228
  • 34.25.14.228
  • 34.25.14.229
  • 34.25.14.229
  • 34.25.14.230
  • 34.25.14.230
  • 34.25.14.243
  • 34.25.14.243
  • 34.25.14.244
  • 34.25.14.244
  • 34.25.14.245
  • 34.25.14.245
  • 34.25.14.246
  • 34.25.14.246

elapsed time: 01:18:59

运行时间:01:18:59

Operating System

操作系统

  • Microsoft Windows XP Professionnel 32-bit SP3
  • Microsoft Windows XP professional 32位SP3

CPU

CPU

  • Intel Xeon W3565 @ 3.20GHz 39 °C

    Intel Xeon W3565 @ 3.20 ghz 39°C

  • Bloomfield 45nm Technology

    布卢姆菲尔德45 nm制程技术

RAM

内存

  • 4,00 Go Dual-Channel DDR3 @ 532MHz (7-7-7-20)
  • 400 Go双通道DDR3 @ 532MHz (7-7-7-7-7-7 -20)

Motherboard

主板

  • LENOVO Lenovo (1366-pin LGA)
  • 联想联想(1366 -销)

#5


0  

This problem is amazing!

这个问题是惊人的!

I solved it with a simple JavaCC grammar and 3 classes : Constant, Variable and a Main. The expression I used as input is (34|45).2\d.14.(227|228|229|230|243|244|245|246), note the \d to specify variable part.

我用一个简单的JavaCC语法和三个类解决了这个问题:常量、变量和Main。我用来输入的表达式是(34|45).2\d.14.(227|228|229|230|243| 245|246),请注意,\d指定可变部分。

Here is the JavaCC grammar:

下面是JavaCC语法:

options
{
    static         = false;
    FORCE_LA_CHECK = true;
    LOOKAHEAD      = 5;
}

PARSER_BEGIN( IPGeneratorFromRegExp )
package hpms.study.jj;

public class IPGeneratorFromRegExp
{
    public final java.util.SortedSet< hpms.study.IPPart > _a = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _b = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _c = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _d = new java.util.TreeSet< hpms.study.IPPart >();

}// class IPGeneratorFromRegExp

PARSER_END( IPGeneratorFromRegExp )

SKIP :
{
  " "
| "\r"
| "\t"
| "\n"
}

TOKEN :
{
    < IPPARTX   :                          (["1"-"9"] | "\\d" ) > 
|   < IPPARTXX  :     (["1"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) > 
|   < IPPART1XX : "1" (["0"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART2XX : "2" (["0"-"4"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART25X : "2" "5"                  (["0"-"5"] | "\\d" ) >
}

void part( java.util.SortedSet< hpms.study.IPPart > parts ):{ Token token = null; }
{
    (
        token=<IPPART25X>
    |   token=<IPPART2XX>
    |   token=<IPPART1XX>
    |   token=<IPPARTXX>
    |   token=<IPPARTX>
    ){
        parts.add(
            token.image.contains( "\\d" )
                ? new hpms.study.Variable( token.image )
                : new hpms.study.Constant( token.image ));
    }
}

void expr( java.util.SortedSet< hpms.study.IPPart > parts ):{}
{
    "(" part( parts ) ( "|" part( parts ))+ ")"
}

void analyze() :{}
{
    ( part( _a ) | expr( _a )) "."
    ( part( _b ) | expr( _b )) "."
    ( part( _c ) | expr( _c )) "."
    ( part( _d ) | expr( _d ))
}

And a fragment of Main.java:

和Main.java的片段:

     Reader                source      = new StringReader( _regExp.getText());
     IPGeneratorFromRegExp ipGenerator = new IPGeneratorFromRegExp( source );
     ipGenerator.analyze();

     SortedSet< String > a = new TreeSet<>();
     SortedSet< String > b = new TreeSet<>();
     SortedSet< String > c = new TreeSet<>();
     SortedSet< String > d = new TreeSet<>();
     for( IPPart<?> part : ipGenerator._a ) part.expandTo( a );
     for( IPPart<?> part : ipGenerator._b ) part.expandTo( b );
     for( IPPart<?> part : ipGenerator._c ) part.expandTo( c );
     for( IPPart<?> part : ipGenerator._d ) part.expandTo( d );

     _result.clear();
     for( String ac : a )
     {
        for( String bc : b )
        {
           for( String cc : c )
           {
              for( String dc : d )
              {
                 _result.add( ac + '.' + bc + '.' + cc + '.'  + dc );
              }// for
           }// for
        }// for
     }// for

And a snapshot of the resulting Swing application (very responsive):

以及由此产生的Swing应用程序的快照(响应非常快):

在Java中,如何从特定的正则表达式中创建所有可能数字的列表?

The whole Eclipse project may download from lfinance.fr.

整个Eclipse项目可以从lfinance.fr下载。

#1


6  

Related:

相关:

A library that generates data matching a regular expression (with limitations): http://code.google.com/p/xeger/

生成匹配正则表达式(有限制)的数据的库:http://code.google.com/p/xeger/

Several solutions, such as conversing a regex into a grammar: Using Regex to generate Strings rather than match them

一些解决方案,例如将regex转换为语法:使用正则表达式生成字符串而不是匹配字符串。


EDIT: Actually, you can make it work!!! The only thing to address is to impose some domain-specific constraints to preclude combinatorial explosion like a+.

编辑:实际上,你可以让它工作!!!唯一要解决的问题是强加一些特定于领域的约束来阻止组合爆炸,比如a+。

If you add to the Xeger class something like this:

如果你给Xeger类添加如下内容:

public void enumerate() {
    System.out.println("enumerate: \"" + regex + "\"");
    int level = 0;
    String accumulated = "";
    enumerate(level, accumulated, automaton.getInitialState());
}

private void enumerate(int level, String accumulated, State state) {
    List<Transition> transitions = state.getSortedTransitions(true);
    if (state.isAccept()) {
        System.out.println(accumulated);
        return;
    }
    if (transitions.size() == 0) {
        assert state.isAccept();
        return;
    }
    int nroptions = state.isAccept() ? transitions.size() : transitions.size() - 1;
    for (int option = 0; option <= nroptions; option++) {
        // Moving on to next transition
        Transition transition = transitions.get(option - (state.isAccept() ? 1 : 0));
        for (char choice = transition.getMin(); choice <= transition.getMax(); choice++) {
            enumerate(level + 1, accumulated + choice, transition.getDest());
        }
    }
}

... and something like this to XegerTest:

…像这样对XegerTest说:

@Test
public void enumerateAllVariants() {
    //String regex = "[ab]{4,6}c";
    String regex = "34\\.25\\.14\\.(227|228|229|230|243|244|245|246)";
    Xeger generator = new Xeger(regex);
    generator.enumerate();
}

... you will get this:

…您将得到:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running nl.flotsam.xeger.XegerTest
enumerate: "34\.25\.14\.(227|228|229|230|243|244|245|246)"
34.25.14.227
34.25.14.228
34.25.14.229
34.25.14.243
34.25.14.244
34.25.14.245
34.25.14.246
34.25.14.230
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.114 sec

... and, guess what. For "[ab]{4,6}c" it correctly produces 112 variants.

…猜猜看。对于“[ab]{4,6}c”,它正确地生成112个变体。

It's really a quick and dirty experiment, but it seems to work ;).

这确实是一项快速而又肮脏的实验,但似乎行得通;

#2


0  

I'd say the technical answer is no, since you can specify in a regex that a character (digit) occurs zero or more times. That "or more" can mean an arbitrarily large number of digits. In practice you might be able to limit the length of the strings and recursively build out a superset of Strings based on the chars you find in the regex and then text them to create your subset list.

我认为技术上的答案是否定的,因为您可以在regex中指定一个字符(数字)出现的次数为零或更多。这个“或更多”可以表示任意多的数字。在实践中,您可能能够限制字符串的长度,并基于在regex中找到的字符递归地构建一组字符串,然后将它们文本化以创建子集列表。

#3


0  

Comment: regular expression is a recursively enumerable language, therefore there exists an algorithm than can enumerate all valid strings.

注释:正则表达式是一种递归可枚举语言,因此存在一种可以枚举所有有效字符串的算法。

Even for a more complex language like Java, there can be an algorithm that enumerates all Java programs; no matter how your particular Java program is written, that algorithm will, in finite time, output a Java program that exactly matches yours.

即使对于像Java这样更复杂的语言,也可以使用一种算法来枚举所有Java程序;无论如何编写特定的Java程序,该算法都会在有限的时间内输出与您的程序完全匹配的Java程序。

But not all languages are enumerable like this; then they are the languages that we know exist but we cannot speak in. They forever elude any primitive intelligence like ours that's but a turing machine.

但并不是所有的语言都是这样列举的;那么它们就是我们知道存在但我们不能说的语言。他们永远逃避不了任何原始的智能,就像我们的,那只是图灵机器。

#4


0  

Brute force, Multithreaded, CPU load 100%:

蛮力,多线程,CPU负载100%:

import java.util.regex.Pattern;

public class AllIP
{
   public static void main( String[] args )
   {
      final Pattern p = Pattern.compile( "34.25.14.(227|228|229|230|243|244|245|246)" );

      int step = 256 / Runtime.getRuntime().availableProcessors();
      for( int range = 0; range < 256; range += step )
      {
         final int from = range;
         final int to   = range + step;
         new Thread(){
            public @Override void run(){
               long atStart = System.currentTimeMillis();
               for( int i = from; i < to ; ++i )
                  for( int j = 1; j < 255; ++j )
                     for( int k = 1; k < 255; ++k )
                        for( int l = 1; l < 255; ++l )
                        {
                           String ip = String.format( "%d.%d.%d.%d", i,j,k,l );
                           if( p.matcher( ip ).matches())
                           {
                              System.out.println( ip );
                           }
                        }
               System.out.println( System.currentTimeMillis() - atStart );
            }
         }.start();
      }
   }
}

Just for fun...

只是为了好玩……

  • 34.25.14.227
  • 34.25.14.227
  • 34.25.14.228
  • 34.25.14.228
  • 34.25.14.229
  • 34.25.14.229
  • 34.25.14.230
  • 34.25.14.230
  • 34.25.14.243
  • 34.25.14.243
  • 34.25.14.244
  • 34.25.14.244
  • 34.25.14.245
  • 34.25.14.245
  • 34.25.14.246
  • 34.25.14.246

elapsed time: 01:18:59

运行时间:01:18:59

Operating System

操作系统

  • Microsoft Windows XP Professionnel 32-bit SP3
  • Microsoft Windows XP professional 32位SP3

CPU

CPU

  • Intel Xeon W3565 @ 3.20GHz 39 °C

    Intel Xeon W3565 @ 3.20 ghz 39°C

  • Bloomfield 45nm Technology

    布卢姆菲尔德45 nm制程技术

RAM

内存

  • 4,00 Go Dual-Channel DDR3 @ 532MHz (7-7-7-20)
  • 400 Go双通道DDR3 @ 532MHz (7-7-7-7-7-7 -20)

Motherboard

主板

  • LENOVO Lenovo (1366-pin LGA)
  • 联想联想(1366 -销)

#5


0  

This problem is amazing!

这个问题是惊人的!

I solved it with a simple JavaCC grammar and 3 classes : Constant, Variable and a Main. The expression I used as input is (34|45).2\d.14.(227|228|229|230|243|244|245|246), note the \d to specify variable part.

我用一个简单的JavaCC语法和三个类解决了这个问题:常量、变量和Main。我用来输入的表达式是(34|45).2\d.14.(227|228|229|230|243| 245|246),请注意,\d指定可变部分。

Here is the JavaCC grammar:

下面是JavaCC语法:

options
{
    static         = false;
    FORCE_LA_CHECK = true;
    LOOKAHEAD      = 5;
}

PARSER_BEGIN( IPGeneratorFromRegExp )
package hpms.study.jj;

public class IPGeneratorFromRegExp
{
    public final java.util.SortedSet< hpms.study.IPPart > _a = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _b = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _c = new java.util.TreeSet< hpms.study.IPPart >();
    public final java.util.SortedSet< hpms.study.IPPart > _d = new java.util.TreeSet< hpms.study.IPPart >();

}// class IPGeneratorFromRegExp

PARSER_END( IPGeneratorFromRegExp )

SKIP :
{
  " "
| "\r"
| "\t"
| "\n"
}

TOKEN :
{
    < IPPARTX   :                          (["1"-"9"] | "\\d" ) > 
|   < IPPARTXX  :     (["1"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) > 
|   < IPPART1XX : "1" (["0"-"9"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART2XX : "2" (["0"-"4"] | "\\d" ) (["0"-"9"] | "\\d" ) >
|   < IPPART25X : "2" "5"                  (["0"-"5"] | "\\d" ) >
}

void part( java.util.SortedSet< hpms.study.IPPart > parts ):{ Token token = null; }
{
    (
        token=<IPPART25X>
    |   token=<IPPART2XX>
    |   token=<IPPART1XX>
    |   token=<IPPARTXX>
    |   token=<IPPARTX>
    ){
        parts.add(
            token.image.contains( "\\d" )
                ? new hpms.study.Variable( token.image )
                : new hpms.study.Constant( token.image ));
    }
}

void expr( java.util.SortedSet< hpms.study.IPPart > parts ):{}
{
    "(" part( parts ) ( "|" part( parts ))+ ")"
}

void analyze() :{}
{
    ( part( _a ) | expr( _a )) "."
    ( part( _b ) | expr( _b )) "."
    ( part( _c ) | expr( _c )) "."
    ( part( _d ) | expr( _d ))
}

And a fragment of Main.java:

和Main.java的片段:

     Reader                source      = new StringReader( _regExp.getText());
     IPGeneratorFromRegExp ipGenerator = new IPGeneratorFromRegExp( source );
     ipGenerator.analyze();

     SortedSet< String > a = new TreeSet<>();
     SortedSet< String > b = new TreeSet<>();
     SortedSet< String > c = new TreeSet<>();
     SortedSet< String > d = new TreeSet<>();
     for( IPPart<?> part : ipGenerator._a ) part.expandTo( a );
     for( IPPart<?> part : ipGenerator._b ) part.expandTo( b );
     for( IPPart<?> part : ipGenerator._c ) part.expandTo( c );
     for( IPPart<?> part : ipGenerator._d ) part.expandTo( d );

     _result.clear();
     for( String ac : a )
     {
        for( String bc : b )
        {
           for( String cc : c )
           {
              for( String dc : d )
              {
                 _result.add( ac + '.' + bc + '.' + cc + '.'  + dc );
              }// for
           }// for
        }// for
     }// for

And a snapshot of the resulting Swing application (very responsive):

以及由此产生的Swing应用程序的快照(响应非常快):

在Java中,如何从特定的正则表达式中创建所有可能数字的列表?

The whole Eclipse project may download from lfinance.fr.

整个Eclipse项目可以从lfinance.fr下载。