是否可以测试正则表达式,看看它是否简化为。*

时间:2022-09-01 22:55:52

I'm developing an application where users enter a regular expression as a filter criterion, however I do not want people to be (easily) able to enter .* (i.e. match anything). The problem is, if I just use if (expression == ".*"), then this could be easily sidestepped by entering something such as .*.*.

我正在开发一个应用程序,在这个应用程序中,用户输入一个正则表达式作为过滤条件,但是我不希望人们(很容易)输入。问题是,如果我只使用if(表达式== ".*"),那么就可以很容易地通过输入诸如.*.*之类的东西来避开这个问题。

Does anyone know of a test that could take a piece of regex and see if is essentially .* but in a slightly more elaborate form?

有没有人知道一种测试可以拿一块regex看看它是否本质上是。

My thoughts are:

我的想法是:

  1. I could see if the expression is one or more repetitions of .*, (i.e. if it matches (\.\*)+ (quotations/escapes may not be entirely accurate, but you get the idea). The problem with this is that there may be other forms of writing a global match (e.g. with $ and ^) that are too exhaustive to even think of upfront, let along test.

    我可以看到这个表达式是一个或多个重复的。*,(也就是说,如果它匹配(\.\*)+(引用/转义可能不完全准确,但是你得到了这个想法)。这里的问题是,可能会有其他写作形式全球匹配(例如美元和^),太详尽,甚至认为前期,我们一起测试。

  2. I could test a few randomly generated Strings with it and assume that if they all pass, the user has entered a globally matching pattern. The problem with this approach is that there could be situations where the expression is sufficiently tight and I just pick bad strings to match against.

    我可以用它测试一些随机生成的字符串,并假设如果它们都通过了,用户就进入了全局匹配模式。这种方法的问题是,在某些情况下,表达式足够紧,我只选择要匹配的坏字符串。

Thoughts, anyone?

思想,有人知道吗?

(FYI, the application is in Java but I guess this is more of an algorithmic question than one for a particular language.)

(顺便说一句,这个应用程序是用Java编写的,但我认为这更像是一个算法问题,而不是针对某种特定语言的问题。)

3 个解决方案

#1


8  

Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions

是的,有一个办法。它涉及到将regex转换为规范的FSM表示。看到http://en.wikipedia.org/wiki/Regular_expression Deciding_equivalence_of_regular_expressions

You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html

您可能会发现已发布的代码为您做了工作。如果没有,详细步骤如下:http://swtch.com/~rsc/regexp/regexp1.html

If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.

如果这似乎是太多的工作,那么您可以使用一个快速和肮脏的概率测试。生成一些随机字符串,看看它们是否与用户的正则表达式匹配。如果它们是匹配的,则可以很好地表明regex过于宽泛。

#2


1  

There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.

有很多很多的可能性可以实现类似的东西。例如,只要把任何类型的字符和反字符放到一个类或一个交替类中,它就会匹配任何东西。所以,我认为用正则表达式是不可能测试另一个正则表达式的等价性。*。

These are some examples that would match the same than .* (they will additionally match the newline characters)

这些示例将与.*匹配(它们将另外匹配换行字符)

/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/

So I assume your idea 2 would be a lot easier to achieve.

所以我认为你的想法2会更容易实现。

#3


0  

Thanks everyone,

谢谢每个人,

I did miss the testing for equivalence entry on the wikipedia, which was interesting.

我错过了在*上对等价条目的测试,这很有趣。

My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!

我对DFAs的记忆(我似乎记得必须证明,或者至少证明,在第二年的CompSci考试中,regex不能测试回文)现在最好是休息一下!

I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.

我将使用生成一组要测试的字符串的方法。如果它们都通过了,那么我相当确信过滤器太宽,需要手动检查。与此同时,至少有一个失败表明该表达式更可能符合目的。

Now to decide what type of strings to generate in order to run the tests....

现在决定什么类型的字符串生成为了运行测试....

Kind regards, Russ.

亲切的问候,拉斯。

#1


8  

Yes, there is a way. It involves converting the regex to a canonical FSM representation. See http://en.wikipedia.org/wiki/Regular_expression#Deciding_equivalence_of_regular_expressions

是的,有一个办法。它涉及到将regex转换为规范的FSM表示。看到http://en.wikipedia.org/wiki/Regular_expression Deciding_equivalence_of_regular_expressions

You can likely find published code that does the work for you. If not, the detailed steps are described here: http://swtch.com/~rsc/regexp/regexp1.html

您可能会发现已发布的代码为您做了工作。如果没有,详细步骤如下:http://swtch.com/~rsc/regexp/regexp1.html

If that seems like too much work, then you can use a quick and dirty probabilistic test. Just Generated some random strings to see if they match the user's regex. If they are match, you have a pretty good indication that the regex is overly broad.

如果这似乎是太多的工作,那么您可以使用一个快速和肮脏的概率测试。生成一些随机字符串,看看它们是否与用户的正则表达式匹配。如果它们是匹配的,则可以很好地表明regex过于宽泛。

#2


1  

There are many, many possibilities to achieve something equivalent to .*. e.g. just put any class of characters and the counter part into a class or a alternation and it will match anything.
So, I think with a regular expression its not possible to test another regular expression for equivalence to .*.

有很多很多的可能性可以实现类似的东西。例如,只要把任何类型的字符和反字符放到一个类或一个交替类中,它就会匹配任何东西。所以,我认为用正则表达式是不可能测试另一个正则表达式的等价性。*。

These are some examples that would match the same than .* (they will additionally match the newline characters)

这些示例将与.*匹配(它们将另外匹配换行字符)

/[\s\S]*/
/(\w|\W)*/
/(a|[^a])*/
/(a|b|[^ab])*/

So I assume your idea 2 would be a lot easier to achieve.

所以我认为你的想法2会更容易实现。

#3


0  

Thanks everyone,

谢谢每个人,

I did miss the testing for equivalence entry on the wikipedia, which was interesting.

我错过了在*上对等价条目的测试,这很有趣。

My memories of DFAs (I seem to recall having to prove, or at least demonstrate, in an exam in 2nd year CompSci that a regex cannot test for palindromes) are probably best left rested at the moment!

我对DFAs的记忆(我似乎记得必须证明,或者至少证明,在第二年的CompSci考试中,regex不能测试回文)现在最好是休息一下!

I am going to go down the approach of generating a set of strings to test. If they all pass, then I am fairly confident that the filter is too broad and needs to be inspected manually. Meanwhile, at least one failure indicates that the expression is more likely to be fit for purpose.

我将使用生成一组要测试的字符串的方法。如果它们都通过了,那么我相当确信过滤器太宽,需要手动检查。与此同时,至少有一个失败表明该表达式更可能符合目的。

Now to decide what type of strings to generate in order to run the tests....

现在决定什么类型的字符串生成为了运行测试....

Kind regards, Russ.

亲切的问候,拉斯。