算法,以查明两个Glob模式(或正则表达式)的匹配是否相交。

时间:2022-01-02 08:11:03

I'm looking at matching glob-style patterns similar the what the Redis KEYS command accepts. Quoting:

我正在寻找匹配的全局样式模式,类似于Redis KEYS命令接受的内容。引用:

  • h?llo matches hello, hallo and hxllo
  • h ?洛匹配你好,哈罗和赫西罗
  • h*llo matches hllo and heeeello
  • h*llo与hllo和heeeello相匹配。
  • h[ae]llo matches hello and hallo, but not hillo
  • 洛和哈罗是一对,但不是希罗

But I am not matching against a text string, but matching the pattern against another pattern with all operators being meaningful on both ends.

但是我并没有匹配文本字符串,而是将模式与另一个模式匹配,所有操作符在两端都是有意义的。

For example these patterns should match against each other in the same row:

例如,这些模式应该在同一行中相互匹配:

prefix*       prefix:extended*
*suffix       *:extended:suffix
left*right    left*middle*right
a*b*c         a*b*d*b*c
hello*        *ok
pre[ab]fix*   pre[bc]fix*

And these should not match:

这些不应该匹配:

prefix*       wrong:prefix:*
*suffix       *suffix:wrong
left*right    right*middle*left
pre[ab]fix*   pre[xy]fix*
?*b*?         bcb

So I'm wondering ...

所以我想知道…

  • if this is possible to do (implement a verification algorithm), if at all?
  • 如果这是可能的(实现一个验证算法),如果有的话?
  • if not possible, what subset of regex would be possible? (i.e. disallow * wildcard?)
  • 如果不可能,regex的哪些子集是可能的?(即不允许*通配符?)
  • if it is indeed possible, what is an efficient algorithm?
  • 如果确实有可能,什么是有效的算法?
  • what are the time complexity required?
  • 所需的时间复杂度是多少?

EDIT: Found this other question on RegEx subset but this is not exactly the same as the words that hello* and *ok matches is not a subset/superset of each other but they do intersect.

编辑:在RegEx子集上找到另一个问题,但这与hello*和*ok匹配的词不是彼此的子集/超集,但它们确实是相交的。

So I guess mathematically, this might be phrased as; is it possible to deterministically check that a set of words that one pattern match, intersecting with a set of words that another pattern matches, result in a non-empty set?

我猜从数学上来说,这可能是这样的;是否有可能决定性地检查一种模式匹配的一组单词,与另一种模式匹配的一组单词相交,结果是一个非空的集合?


EDIT: A friend @neizod drew up this elimination table which neatly visualize what might be a potential/partial solution: Elimination rule

编辑:我的一个朋友@neizod设计了这个消去表,这个消去表清晰地显示了什么可能是潜在的/部分的解决方案:消去规则


EDIT: Will adds extra bounty for those who can also provide working code (in any language) and test cases that proves it.

编辑:将为那些还能提供工作代码(在任何语言中)和测试用例的人增加额外的赏金。


EDIT: Added the ?*b*? test case discovered by @DanielGimenez in the comments.

编辑:添加了? * * ?由@DanielGimenez在评论中发现的测试用例。

5 个解决方案

#1


18  

Now witness the firepower of this fully ARMED and OPERATIONAL battle station!

现在,看看这个全副武装的作战站的火力吧!

(I have worked too much on this answer and my brain has broken; There should be a badge for that.)

(我在这个问题上做了太多的研究,我的大脑已经崩溃了;应该有个徽章。)

In order to determine if two patterns intersect, I have created a recursive backtracking parser -- when Kleene stars are encountered a new stack is created so that if it fails in the future everything is rolled back and and the star consumes the next character.

为了确定两个模式是否相交,我创建了一个递归回溯解析器——当遇到Kleene星时,会创建一个新的堆栈,以便如果它在未来失败,那么所有内容都将回滚,并且该星会使用下一个字符。

You can view the history of this answer to determine how arrived at all this and why it was necessary, but basically it wasn't sufficient to determine an intersection by looking ahead only one token, which was what I was doing before.

你可以查看这个答案的历史来确定它是如何到达的以及为什么它是必要的,但基本上,仅仅往前看一个标记来确定一个交集是不够的,这就是我之前所做的。

This was the case that broke the old answer [abcd]d => *d. The set matches the d after the star, so the left side would still have tokens remaining, while the right side would be complete. However, these patterns two intersect on ad, bd, cd and dd, so that needed to be fixed. My almost O(N) answer was thrown out.

这个例子打破了原来的答案[abcd]d => *d。集合匹配星号后面的d,所以左边仍然有令牌,而右边将是完整的。然而,这两种模式在ad、bd、cd和dd上相交,需要加以修正。我的几乎是O(N)的回答被否定了。

Lexer

The lexing process is trivial, except that is processes escape characters and removes redundant stars. Tokens are broken out into sets, stars, wild character (?), and character. This is different than my previous versions where one token was a string of characters instead of a single character. As more cases come up, having strings as tokens was more of a hindrance than advantage.

lexing过程是平凡的,除了它是处理转义字符和删除冗余恒星的过程。代币被分割成场景、星星、野性性格(?)和性格。这与我以前的版本不同,以前的版本中一个令牌是字符串,而不是单个字符。随着更多的情况出现,将字符串作为令牌是比优势更大的障碍。

Parser

Most of the functions of the parser are pretty trivial. A switch given the left side's type, calls a function that is a switch that determines the appropriate function to compare it with the right side's type. The result from the comparison bubbles up the two switches to the original callee, typically the main loop of the parser.

解析器的大多数功能都非常简单。给定左侧类型的开关,调用一个函数,该函数是一个开关,用于确定要将其与右侧类型进行比较的适当函数。比较的结果将这两个开关气泡到原来的callee,通常是解析器的主循环。

Parsing Stars

The simplicity ends with the star. When that is encountered it takes over everything. First it compares its side's next token with the other side's, advancing the other side until if finds a match.

简单以星星结束。当遇到这种情况时,它会接管一切。首先,它比较对方的下一个令牌,向前移动,直到找到匹配。

Once the match is found, it then checks if everything matches all the way up to the end of both patterns. If it does then the patterns intersect. Otherwise, it advances the other side's next token from the original one it was compared against and repeats the process.

一旦找到匹配项,它就会检查是否所有的东西都匹配到两个模式的末尾。如果它这样做了,那么图案就会相交。否则,它会将另一方的下一个令牌从原始令牌中向前推进,并重复这个过程。

When two anys are encountered then the take off into their own alternative branches starting from each others' next token.

当遇到两个anys时,就从彼此的下一个令牌开始,将其转换为各自的可选分支。

function intersects(left, right) {
    var lt, rt,
        result = new CompareResult(null, null, true);

    lt = (!left || left instanceof Token) ? left : tokenize(left);
    rt = (!right || right instanceof Token) ? right : tokenize(right);

    while (result.isGood && (lt || rt)) {
        result = tokensCompare(lt, rt);

        lt = result.leftNext;
        rt = result.rightNext;
    }

    return result;
}

function tokensCompare(lt, rt) {
    if (!lt && rt) return tokensCompare(rt, lt).swapTokens();

    switch (lt.type) {
        case TokenType.Char: return charCompare(lt, rt);
        case TokenType.Single: return singleCompare(lt, rt);
        case TokenType.Set: return setCompare(lt, rt);
        case TokenType.AnyString: return anyCompare(lt, rt);
    }
}

function anyCompare(tAny, tOther) {
    if (!tOther) return new CompareResult(tAny.next, null);

    var result = CompareResult.BadResult;

    while (tOther && !result.isGood) {
        while (tOther && !result.isGood) {
            switch (tOther.type) {
                case TokenType.Char: result = charCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Single: result = singleCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Set: result = setCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.AnyString:
                    // the anyCompare from the intersects will take over the processing.
                    result = intersects(tAny, tOther.next);
                    if (result.isGood) return result;
                    return intersects(tOther, tAny.next).swapTokens();
            }

            if (!result.isGood) tOther = tOther.next;
        }

        if (result.isGood) {
            // we've found a starting point, but now we want to make sure this will always work.
            result = intersects(result.leftNext, result.rightNext);
            if (!result.isGood) tOther = tOther.next;
        }
    }

    // If we never got a good result that means we've eaten everything.
    if (!result.isGood) result = new CompareResult(tAny.next, null, true);

    return result;
}

function charCompare(tChar, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return charCharCompare(tChar, tOther); 
        case TokenType.Single: return new CompareResult(tChar.next, tOther.next);
        case TokenType.Set: return setCharCompare(tOther, tChar).swapTokens(); 
        case TokenType.AnyString: return anyCompare(tOther, tChar).swapTokens();
    }
}

function singleCompare(tSingle, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Single: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Set: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.AnyString: return anyCompare(tOther, tSingle).swapTokens();
    }
}
function setCompare(tSet, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return setCharCompare(tSet, tOther);
        case TokenType.Single: return new CompareResult(tSet.next, tOther.next);
        case TokenType.Set: return setSetCompare(tSet, tOther);
        case TokenType.AnyString: return anyCompare(tOther, tSet).swapTokens();
    }
}

function anySingleCompare(tAny, tSingle) {
    var nextResult = (tAny.next) ? singleCompare(tSingle, tAny.next).swapTokens() :
        new CompareResult(tAny, tSingle.next);
    return (nextResult.isGood) ? nextResult: new CompareResult(tAny, tSingle.next);
}

function anyCharCompare(tAny, tChar) {
    var nextResult = (tAny.next) ? charCompare(tChar, tAny.next).swapTokens() :
        new CompareResult(tAny, tChar.next);

    return (nextResult.isGood) ? nextResult : new CompareResult(tAny, tChar.next);
}

function charCharCompare(litA, litB) {
    return (litA.val === litB.val) ?
        new CompareResult(litA.next, litB.next) : CompareResult.BadResult;
}

function setCharCompare(tSet, tChar) {
    return (tSet.val.indexOf(tChar.val) > -1) ?
        new CompareResult(tSet.next, tChar.next) : CompareResult.BadResult;
}

function setSetCompare(tSetA, tSetB) {
    var setA = tSetA.val,
        setB = tSetB.val;

    for (var i = 0, il = setA.length; i < il; i++) {
        if (setB.indexOf(setA.charAt(i)) > -1) return new CompareResult(tSetA.next, tSetB.next);
    }
    return CompareResult.BadResult;
}

jsFiddle

Time Complexity

Anything with the words "recursive backtracking" in it is at least O(N2).

任何带有“递归回溯”字样的东西都至少是O(N2)。

Maintainability and Readability

I purposely broke out any branches into there own functions with a singular switch. Assitionally I used named constants when a one character string would suffice. Doing this made the code longer and more verbose, but I think it makes it easier to follow.

我故意用一个奇异的开关把任何分支分解成自己的函数。通常,当一个字符串足够时,我使用命名常量。这样做会使代码更长、更详细,但我认为这样更容易遵循。

Tests

You can view all the tests in the Fiddle. You can view the comments in the Fiddle output to glean their purposes. Each token type was tested against each token type, but I haven't made one that tried all possible comparisons in a single test. I also came up with a few random tough ones like the one below.

你可以在小提琴里看到所有的测试。您可以查看小提琴输出中的注释,以收集它们的目的。每个令牌类型都针对每个令牌类型进行了测试,但是我还没有在一个测试中尝试所有可能的比较。我还随便找了几个棘手的问题,比如下面的那个。

abc[def]?fghi?*nop*[tuv]uv[wxy]?yz => a?[cde]defg*?ilmn[opq]*tu*[xyz]*

abc(def)fghi ? * nop *(德国莱茵)紫外线(wxy)?yz = > ?(cde)defg * ? ilmn[opq]*你*(某某)*

I added an interface on the jsFiddle if anybody wants to test this out themselves. The logging is broken once I added the recursion.

如果有人想自己测试的话,我在jsFiddle添加了一个接口。一旦添加了递归,日志记录就会被破坏。

I don't think I tried enough negative tests, especially with the last version I created.

我认为我没有做过足够的负面测试,尤其是我创建的上一个版本。

Optimization

Currently the solution is a brute force one, but is sufficient to handle any case. I would like to come back to this at some point to improve the time complexity with some simple optimizations.

目前的解决方案是使用蛮力,但对于任何情况都足够了。我想在某个时候回到这个问题上来,通过一些简单的优化来提高时间复杂度。

Checks at the start to reduce comparisons could increase processing time for certain common scenarios. For example, if one pattern starts with a star and one ends with one then we already know they will intersect. I can also check all the characters from the start and end of the patterns and remove them if the match on both patterns. This way they are excluded from any future recursion.

减少比较的开始检查可以增加某些常见场景的处理时间。例如,如果一个模式从一个恒星开始,一个以一个结束,那么我们已经知道它们会相交。我还可以从模式的开始和结束处检查所有字符,并在两个模式上匹配时删除它们。这样,它们就被排除在将来的任何递归之外。

Acknowledgements

I used @m.buettner's tests initially to test my code before I came up with my own. Also I walked through his code to help me understand the problem better.

我用@m。比特纳的测试最初是为了测试我的代码,然后我才想到了我自己的代码。我也浏览了他的代码,以帮助我更好地理解这个问题。

#2


7  

With your very reduced pattern language, the pastebin link in your question and jpmc26's comments are pretty much all the way there: the main question is, whether the literal left and right end of your input strings match. If they do, and both contain at least one *, the strings match (because you can always match the other strings intermediate literal text with that star). There is one special case: if only one of them is empty (after removing pre- and suffix), they can still match if the other consists entirely of *s.

使用非常简化的模式语言,您的问题中的pastebin链接和jpmc26的注释基本上都是这样:主要问题是,您的输入字符串的文字左端和右端是否匹配。如果它们同时包含至少一个*,则字符串匹配(因为您总是可以将其他字符串中间文字文本与该星号匹配)。有一种特殊情况:如果其中一个是空的(在删除了前缀和后缀之后),如果另一个完全是*s,它们仍然可以匹配。

Of course, when checking whether the ends of the string match, you need to take into account the single-character wildcard ? and character classes, too. The single-character wildcard is easy: it cannot fail, because it will always match whatever the other character is. If it's a character class, and the other is just a character, you need to check whether the character is in the class. If they are both classes, you need to check for an intersection of the classes (which is a simple set intersection).

当然,在检查字符串的两端是否匹配时,需要考虑单字符通配符?和字符类。单字符通配符很容易:它不会失败,因为它总是与其他字符匹配。如果它是一个字符类,而另一个只是一个字符,则需要检查该字符是否在类中。如果它们都是类,则需要检查类的交集(这是一个简单的集合交集)。

Here is all of that in JavaScript (check out the code comments to see how the algorithm I outlined above maps to the code):

以下是JavaScript中的所有内容(请查看代码注释,以查看我上面概述的算法如何映射到代码):

var trueInput = [
    { left: 'prefix*', right: 'prefix:extended*' },
    { left: '*suffix', right: '*:extended:suffix' },
    { left: 'left*right', right: 'left*middle*right' },
    { left: 'a*b*c', right: 'a*b*d*b*c' },
    { left: 'hello*', right: '*ok' },
    { left: '*', right: '*'},
    { left: '*', right: '**'},
    { left: '*', right: ''},
    { left: '', right: ''},
    { left: 'abc', right: 'a*c'},
    { left: 'a*c', right: 'a*c'},
    { left: 'a[bc]d', right: 'acd'},
    { left: 'a[bc]d', right: 'a[ce]d'},
    { left: 'a?d', right: 'acd'},
    { left: 'a[bc]d*wyz', right: 'abd*w[xy]z'},
];

var falseInput = [
    { left: 'prefix*', right: 'wrong:prefix:*' },
    { left: '*suffix', right: '*suffix:wrong' },
    { left: 'left*right', right: 'right*middle*left' },
    { left: 'abc', right: 'abcde'},
    { left: 'abcde', right: 'abc'},
    { left: 'a[bc]d', right: 'aed'},
    { left: 'a[bc]d', right: 'a[fe]d'},
    { left: 'a?e', right: 'acd'},
    { left: 'a[bc]d*wyz', right: 'abc*w[ab]z'},
];

// Expects either a single-character string (for literal strings
// and single-character wildcards) or an array (for character
// classes).
var characterIntersect = function(a,b) {
    // If one is a wildcard, there is an intersection.
    if (a === '?' || b === '?')
        return true;

    // If both are characters, they must be the same.
    if (typeof a === 'string' && typeof b === 'string')
        return a === b;

    // If one is a character class, we check that the other
    // is contained in the class.
    if (a instanceof Array && typeof b === 'string')
        return (a.indexOf(b) > -1);
    if (b instanceof Array && typeof a === 'string')
        return (b.indexOf(a) > -1);

    // Now both have to be arrays, so we need to check whether
    // they intersect.
    return a.filter(function(character) {
        return (b.indexOf(character) > -1);
    }).length > 0;
};

var patternIntersect = function(a,b) {
    // Turn the strings into character arrays because they are
    // easier to deal with.
    a = a.split("");
    b = b.split("");

    // Check the beginnings of the string (up until the first *
    // in either of them).
    while (a.length && b.length && a[0] !== '*' && b[0] !== '*')
    {
        // Remove the first character from each. If it's a [,
        // extract an array of all characters in the class.
        aChar = a.shift();
        if (aChar == '[')
        {
            aChar = a.splice(0, a.indexOf(']'));
            a.shift(); // remove the ]
        }
        bChar = b.shift();
        if (bChar == '[')
        {
            bChar = b.splice(0, b.indexOf(']'));
            b.shift(); // remove the ]
        }

        // Check if the two characters or classes overlap.
        if (!characterIntersect(aChar, bChar))
            return false;
    }

    // Same thing, but for the end of the string.
    while (a.length && b.length && a[a.length-1] !== '*' && b[b.length-1] !== '*')
    {
        aChar = a.pop();
        if (aChar == ']')
        {
            aChar = a.splice(a.indexOf('[')+1, Number.MAX_VALUE);
            a.pop(); // remove the [
        }
        bChar = b.pop();
        if (bChar == ']')
        {
            bChar = b.splice(b.indexOf('[')+1, Number.MAX_VALUE);
            b.pop(); // remove the [
        }

        if (!characterIntersect(aChar, bChar))
            return false;
    }

    // If one string is empty, the other has to be empty, too, or
    // consist only of stars.
    if (!a.length && /[^*]/.test(b.join('')) ||
        !b.length && /[^*]/.test(b.join('')))
        return false;

    // The only case not covered above is that both strings contain
    // a * in which case they certainly overlap.
    return true;
};

console.log('Should be all true:');
console.log(trueInput.map(function(pair) {
    return patternIntersect(pair.left, pair.right);
}));

console.log('Should be all false:');
console.log(falseInput.map(function(pair) {
    return patternIntersect(pair.left, pair.right);
}));

It's not the neatest implementation, but it works and is (hopefully) still quite readable. There is a fair bit of code duplication with checking the beginning and the end (which could be alleviated with a simple reverse after checking the beginning - but I figured that would just obscure things). And there are probably tons of other bits that could be greatly improved, but I think the logic is all in place.

它不是最整洁的实现,但它可以工作,而且(希望)仍然是可读的。检查开始和结束有相当多的代码重复(检查开始后可以用简单的反向来缓解),但我认为这只是模糊的东西)。可能还有大量的其他比特可以被大大改进,但是我认为逻辑是完全正确的。

A few more remarks: the implementation assumes that the patterns are well-formatted (no unmatched opening or closing brackets). Also, I took the array intersection code from this answer because it's compact - you could certainly improve on the efficiency of that if necessary.

还有几点:实现假定模式是格式良好的(没有不匹配的开括号或闭括号)。而且,我从这个答案中选取了数组的交集代码,因为它很紧凑——如果需要的话,你当然可以提高它的效率。

Regardless of those implementation details, I think I can answer your complexity question, too: the outer loop goes over both strings at the same time, a character at a time. So that's linear complexity. Everything inside the loop can be done in constant time, except the character class tests. If one character is a character class and the other isn't, you need linear time (with the size of the class being the parameter) to check whether the character is in the class. But this doesn't make it quadratic, because each character in the class means one less iteration of the outer loop. So that's still linear. The most costly thing is hence the intersection of two character classes. This might be more complex that linear time, but the worst it could get is O(N log N): after all, you could just sort both character classes, and then find an intersection in linear time. I think you might even be able to get overall linear time complexity, by hashing the characters in the character class to their Unicode code point (.charCodeAt(0) in JS) or some other number - and finding an intersection in a hashed set is possible in linear time. So, if you really want to, I think you should be able to get down to O(N).

不管这些实现细节如何,我想我也可以回答您的复杂性问题:外部循环同时处理两个字符串,一次一个字符。这就是线性复杂度。除了字符类测试之外,循环中的所有操作都可以在固定的时间内完成。如果一个字符是一个字符类,而另一个不是,则需要线性时间(以类的大小作为参数)来检查字符是否在类中。但这并不是二次函数,因为类中的每个字符都意味着外部循环的迭代次数减少了一次。这仍然是线性的。最昂贵的东西是两个字符类的交集。这可能比线性时间更复杂,但最坏的情况是O(N log N):毕竟,你可以对两个字符类进行排序,然后在线性时间内找到一个交集。我认为您甚至可以通过将字符类中的字符哈希到它们的Unicode代码点(. charcodeat(0)在JS中)或其他一些数字,来获得总体的线性时间复杂度——并且在哈希集中找到交集在线性时间是可能的。所以,如果你真的想要,我认为你应该可以把它降到O(N)

And what is N? The upper limit is sum of the length of both patterns, but in most cases it will actually be less (depending on the length of prefixes and suffixes of both patterns).

N是什么?上限是两个模式长度的总和,但在大多数情况下它实际上会更小(取决于两个模式的前缀和后缀的长度)。

Please point me to any edge-cases my algorithm is missing. I'm also happy about suggested improvements, if they improve or at least don't reduce the clarity of the code.

请给我指出任何我的算法缺失的边缘情况。我还对建议的改进感到高兴,如果改进或至少没有降低代码的清晰度。

Here is a live demo on JSBin (thanks to chakrit for pasting it there).

这是JSBin上的一个实时演示(感谢chakrit将它粘贴到那里)。


EDIT: As Daniel pointed out, there is a conceptual edge-case that my algorithm misses out on. If (before or after elimination of the beginning and end) one string contains no * and the other does, there are cases, where the two still *. Unfortunately, I don't have the time right now to adjust my code snippet to accommodate that problem, but I can outline how to resolve it.

编辑:正如Daniel指出的,有一个概念性的边缘案例,我的算法错过了。如果(在开始和结束之前或之后)一个字符串不包含*和另一个字符串,那么就会出现两个仍然冲突的情况。不幸的是,我现在没有时间调整代码片段以适应这个问题,但是我可以概述如何解决它。

After eliminating both ends of the strings, if both strings are either empty or both contain at least *, they will always match (go through the possible *-distributions after complete elimination to see this). The only case that's not trivial is if one string still contains *, but the other doesn't (be it empty or not). What we now need to do is walk both strings again from left to right. Let me call the string that contains * A and the one that doesn't contain * B.

在消除两个字符串的两端之后,如果两个字符串都是空的,或者两个都包含至少*,那么它们将始终匹配(在完全消除之后,遍历可能的*-distribution以查看此)。唯一不平凡的情况是,如果一个字符串仍然包含*,而另一个字符串不包含*(不管它是否为空)。我们现在需要做的是再次从左到右遍历这两个字符串。让我调用包含* A的字符串和不包含* B的字符串。

We walk A from left to right, skipping all * (paying attention only to ?, character classes and literal characters). For each of the relevant tokens, we check from left to right, if it can be matched in B (stopping at the first occurrence) and advance our B-cursor to that position. If we ever find a token in A that cannot be found in B any more, they do not match. If we manage to find a match for each token in A, they do match. This way, we still use linear time, because there is no backtracking involved. Here are two examples. These two should match:

我们从左到右走A,跳过所有*(只注意?,字符类和文字字符)。对于每个相关的令牌,我们从左到右进行检查,如果可以在B中匹配它(在第一次出现时停止)并将我们的B游标推进到那个位置。如果我们在a中找到了一个在B中再也找不到的令牌,它们就不匹配了。如果我们设法为a中的每个令牌找到匹配,它们就会匹配。这样,我们仍然使用线性时间,因为不涉及回溯。这里有两个例子。这两个应该匹配:

A: *a*[bc]*?*d* --- B: db?bdfdc
    ^                    ^
A: *a*[bc]*?*d* --- B: db?bdfdc
      ^                   ^
A: *a*[bc]*?*d* --- B: db?bdfdc
           ^               ^
A: *a*[bc]*?*d* --- B: db?bdfdc
             ^               ^

These two should not match:

这两者不应该匹配:

A: *a*[bc]*?*d* --- B: dbabdfc
    ^                    ^
A: *a*[bc]*?*d* --- B: dbabdfc
      ^                   ^
A: *a*[bc]*?*d* --- B: dbabdfc
           ^               ^
A: *a*[bc]*?*d* --- B: dbabdfc
             !               

It fails, because the ? cannot possibly match before the second d, after which there is no further d in B to accommodate for the last d in A.

它失败了,因为?不可能在第二个d之前匹配,在第二个d之后,B中不再有d来容纳A中的最后一个d。

This would probably be easy to add to my current implementation, if I had taken the time to properly parse the string into token objects. But now, I'd have to go through the trouble of parsing those character classes again. I hope this written outline of the addition is sufficient help.

如果我花时间将字符串正确地解析为令牌对象,那么很可能很容易添加到当前实现中。但是现在,我不得不再次解析这些字符类。我希望这个补充的书面提纲能有足够的帮助。

PS: Of course, my implementation does also not account for escaping metacharacters, and might choke on * inside character classes.

当然,我的实现也不能解释转义元字符,而且可能会在字符类中阻塞。

#3


6  

These special patterns are considerably less powerful that full regular expressions, but I'll point out that it is possible to do what you want even with general regular expressions. These must be "true" regexes, i.e. those that use only Kleene star, alternation ( the | operation ), and concatenation with any fixed alphabet plus the empty string and empty set. Of course you can also use any syntactic sugar on these ops: one-or-more (+), optional (?). Character sets are just a special kind of alternation [a-c] == a|b|c.

这些特殊的模式远不如完整的正则表达式强大,但我将指出,即使使用一般的正则表达式,也可以做您想做的事情。这些必须是“真正的”regex,即那些只使用Kleene star、alternation(|操作),并连接任何固定的字母表,加上空字符串和空集。当然,您还可以在这些操作上使用任何语法糖:1 -or-more(+),可选(?)字符集只是一种特殊的变换[a-c] = a|b|c。

The algorithm is simple in principle: Convert each regex to a DFA using the standard constructions: Thompson followed by powerset. Then use the cross product construction to compute the intersection DFA of the two originals. Finally check this intersection DFA to determine if it accepts at least one string. This is just a dfs from the start state to see if an accepting state can be reached.

该算法原理简单:使用标准结构将每个regex转换为DFA: Thompson和powerset。然后利用叉乘结构计算出两份原件的交点DFA。最后检查这个交集DFA以确定它是否接受至少一个字符串。这只是从开始状态的dfs,以查看是否可以到达接受状态。

If you are not familiar with these algorithms, it's easy to find Internet references.

如果您不熟悉这些算法,那么很容易找到Internet引用。

If at least one string is accepted by the intersection DFA, there is a match between the original regexes, and the path discovered by the dfs gives a string that satisfies both. Else there is no match.

如果交集DFA接受了至少一个字符串,则原始正则表达式之间有一个匹配,dfs发现的路径提供了一个满足这两个条件的字符串。否则就没有对手了。

#4


5  

Good question!

好问题!

The main complexity here is handling character classes ([...]). My approach is to replace each one with a regular expression that looks for either exactly one of the specified characters (or ?) or another character class that includes at least one of the specified characters. So for [xyz], this would be: ([xyz?]|\[[^\]]*[xyz].*?\]) - see below:

这里的主要复杂性是处理字符类([…])。我的方法是用一个正则表达式替换每个字符,该表达式查找指定的字符(或?)或包含至少一个指定字符的另一个字符类。(某某),这将是:((xyz ?)| \[[^ \]]*[xyz]。* ? \]),见下图:

算法,以查明两个Glob模式(或正则表达式)的匹配是否相交。

Then for "prefixes" (everything before the first *), put ^ at the beginning or for "suffixes" (everything after the last *), put $ at the end.

然后“前缀”(第*)之前的一切,把^初或“后缀”(一切过去后*),最后把美元。

Further details:-

进一步的细节:

  1. Also need to replace all instances of ? with (\[.*?\]|[^\\]]) to make it match either a character class or single character (excluding an opening square bracket).
  2. 还需要替换所有的实例?(\[。* ? \]|[^ \ \]]),使其匹配一个字符类或单个字符(不包括开放方括号)。
  3. Also need to replace each individual character that is not in a character class and is not ? to make it match either the same character, ? or a character class that includes the character. E.g. a would become ([a?]|\[[^\]]*a.*?\]). (A bit long-winded but turned out to be necessary - see comments below).
  4. 还需要替换不属于字符类且不属于字符类的每个字符?为了匹配相同的字符,?或者一个包含字符的字符类。如将成为([?]| \[[^ \]]*。* ? \])。(有点啰嗦,但结果证明是必要的——见下面的评论)。
  5. The testing should be done both ways round as follows: Test prefix #1 converted into regex against prefix #2 then test prefix #2 converted into regex against prefix #1. If either match, the prefixes can be said to "intersect".
  6. 测试应该按照以下两种方式进行:测试前缀#1在前缀#2上转换为regex,然后测试前缀#2在前缀#1上转换为regex。如果两者匹配,前缀可以被称为“intersect”。
  7. Repeat step (3.) for suffixes: For a positive result, both prefixes and suffixes must intersect.
  8. 对于后缀重复步骤(3.):对于一个积极的结果,前缀和后缀都必须相交。

EDIT: In addition to the above, there is a special case when one of the patterns contains at least one * but the other doesn't. In this case, the whole of the pattern with * should be converted into a regular expression: * should match against anything but with the proviso that it only includes whole character classes. This can be done by replacing all instances of * with (\[.*?\]|[^\\]]).

编辑:除了上面提到的,当其中一个模式包含至少一个*,而另一个模式不包含时,会出现特殊情况。在这种情况下,将*的整个模式转换为一个正则表达式:*应该与任何内容相匹配,但前提是它只包含整个字符类。可以通过取代*的所有实例(\[。* ? \]|[^ \ \]])。

To avoid this answer becoming bulky I won't post the full code but there is a working demo with unit tests here: http://jsfiddle.net/mb3Hn/4/

为了避免这个答案变得庞大,我不会发布完整的代码,但是这里有一个带有单元测试的演示:http://jsfiddle.net/mb3Hn/4/

EDIT #2 - Known incompleteness: In its current form, the demo doesn't cater for escaped characters (e.g. \[). Not a great excuse but I only noticed these late in the day - they aren't mentioned in the question, only the link. To handle them, a bit of additional regex complexity would be needed, e.g. to check for non-existence of a backslash immediately before the [. This should be fairly painless with negative lookbehind but unfortunately Javascript doesn't support it. There are workarounds such as reversing both the string and regular expression with negative lookahead but I'm not keen on making the code less readable with this extra complexity and not sure how important it is to the OP so will leave it as an "exercise for ther reader". In retrospect, should maybe have chosen a language with more comprehensive regex support!

编辑#2 -已知的不完全性:在当前的形式中,演示程序不包含转义字符(例如\[)。这不是一个很好的借口,但我只是在那天晚些时候才注意到这些——问题中没有提到它们,只有联系。要处理它们,需要一些额外的regex复杂性,例如,在[之前检查是否存在反斜杠。这应该是相当容易的负面的外观,但不幸的是Javascript不支持它。有一些变通的办法,比如将字符串和正则表达式都颠倒过来,但我不希望代码变得更难以读懂,因为代码会变得更复杂,也不确定它对OP有多重要,因此将它作为“对其他读者的练习”。回想起来,也许应该选择一种更全面的regex支持的语言!

#5


0  

Determining whether a regex matches a subset of another regex using greenery:

确定一个regex是否与使用绿叶的另一个regex的子集匹配:

First, pip3 install https://github.com/ferno/greenery/archive/master.zip.

首先,pip3安装https://github.com/ferno/greenery/archive/master.zip。

Then:

然后:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n

#1


18  

Now witness the firepower of this fully ARMED and OPERATIONAL battle station!

现在,看看这个全副武装的作战站的火力吧!

(I have worked too much on this answer and my brain has broken; There should be a badge for that.)

(我在这个问题上做了太多的研究,我的大脑已经崩溃了;应该有个徽章。)

In order to determine if two patterns intersect, I have created a recursive backtracking parser -- when Kleene stars are encountered a new stack is created so that if it fails in the future everything is rolled back and and the star consumes the next character.

为了确定两个模式是否相交,我创建了一个递归回溯解析器——当遇到Kleene星时,会创建一个新的堆栈,以便如果它在未来失败,那么所有内容都将回滚,并且该星会使用下一个字符。

You can view the history of this answer to determine how arrived at all this and why it was necessary, but basically it wasn't sufficient to determine an intersection by looking ahead only one token, which was what I was doing before.

你可以查看这个答案的历史来确定它是如何到达的以及为什么它是必要的,但基本上,仅仅往前看一个标记来确定一个交集是不够的,这就是我之前所做的。

This was the case that broke the old answer [abcd]d => *d. The set matches the d after the star, so the left side would still have tokens remaining, while the right side would be complete. However, these patterns two intersect on ad, bd, cd and dd, so that needed to be fixed. My almost O(N) answer was thrown out.

这个例子打破了原来的答案[abcd]d => *d。集合匹配星号后面的d,所以左边仍然有令牌,而右边将是完整的。然而,这两种模式在ad、bd、cd和dd上相交,需要加以修正。我的几乎是O(N)的回答被否定了。

Lexer

The lexing process is trivial, except that is processes escape characters and removes redundant stars. Tokens are broken out into sets, stars, wild character (?), and character. This is different than my previous versions where one token was a string of characters instead of a single character. As more cases come up, having strings as tokens was more of a hindrance than advantage.

lexing过程是平凡的,除了它是处理转义字符和删除冗余恒星的过程。代币被分割成场景、星星、野性性格(?)和性格。这与我以前的版本不同,以前的版本中一个令牌是字符串,而不是单个字符。随着更多的情况出现,将字符串作为令牌是比优势更大的障碍。

Parser

Most of the functions of the parser are pretty trivial. A switch given the left side's type, calls a function that is a switch that determines the appropriate function to compare it with the right side's type. The result from the comparison bubbles up the two switches to the original callee, typically the main loop of the parser.

解析器的大多数功能都非常简单。给定左侧类型的开关,调用一个函数,该函数是一个开关,用于确定要将其与右侧类型进行比较的适当函数。比较的结果将这两个开关气泡到原来的callee,通常是解析器的主循环。

Parsing Stars

The simplicity ends with the star. When that is encountered it takes over everything. First it compares its side's next token with the other side's, advancing the other side until if finds a match.

简单以星星结束。当遇到这种情况时,它会接管一切。首先,它比较对方的下一个令牌,向前移动,直到找到匹配。

Once the match is found, it then checks if everything matches all the way up to the end of both patterns. If it does then the patterns intersect. Otherwise, it advances the other side's next token from the original one it was compared against and repeats the process.

一旦找到匹配项,它就会检查是否所有的东西都匹配到两个模式的末尾。如果它这样做了,那么图案就会相交。否则,它会将另一方的下一个令牌从原始令牌中向前推进,并重复这个过程。

When two anys are encountered then the take off into their own alternative branches starting from each others' next token.

当遇到两个anys时,就从彼此的下一个令牌开始,将其转换为各自的可选分支。

function intersects(left, right) {
    var lt, rt,
        result = new CompareResult(null, null, true);

    lt = (!left || left instanceof Token) ? left : tokenize(left);
    rt = (!right || right instanceof Token) ? right : tokenize(right);

    while (result.isGood && (lt || rt)) {
        result = tokensCompare(lt, rt);

        lt = result.leftNext;
        rt = result.rightNext;
    }

    return result;
}

function tokensCompare(lt, rt) {
    if (!lt && rt) return tokensCompare(rt, lt).swapTokens();

    switch (lt.type) {
        case TokenType.Char: return charCompare(lt, rt);
        case TokenType.Single: return singleCompare(lt, rt);
        case TokenType.Set: return setCompare(lt, rt);
        case TokenType.AnyString: return anyCompare(lt, rt);
    }
}

function anyCompare(tAny, tOther) {
    if (!tOther) return new CompareResult(tAny.next, null);

    var result = CompareResult.BadResult;

    while (tOther && !result.isGood) {
        while (tOther && !result.isGood) {
            switch (tOther.type) {
                case TokenType.Char: result = charCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Single: result = singleCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.Set: result = setCompare(tOther, tAny.next).swapTokens(); break;
                case TokenType.AnyString:
                    // the anyCompare from the intersects will take over the processing.
                    result = intersects(tAny, tOther.next);
                    if (result.isGood) return result;
                    return intersects(tOther, tAny.next).swapTokens();
            }

            if (!result.isGood) tOther = tOther.next;
        }

        if (result.isGood) {
            // we've found a starting point, but now we want to make sure this will always work.
            result = intersects(result.leftNext, result.rightNext);
            if (!result.isGood) tOther = tOther.next;
        }
    }

    // If we never got a good result that means we've eaten everything.
    if (!result.isGood) result = new CompareResult(tAny.next, null, true);

    return result;
}

function charCompare(tChar, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return charCharCompare(tChar, tOther); 
        case TokenType.Single: return new CompareResult(tChar.next, tOther.next);
        case TokenType.Set: return setCharCompare(tOther, tChar).swapTokens(); 
        case TokenType.AnyString: return anyCompare(tOther, tChar).swapTokens();
    }
}

function singleCompare(tSingle, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Single: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.Set: return new CompareResult(tSingle.next, tOther.next);
        case TokenType.AnyString: return anyCompare(tOther, tSingle).swapTokens();
    }
}
function setCompare(tSet, tOther) {
    if (!tOther) return CompareResult.BadResult;

    switch (tOther.type) {
        case TokenType.Char: return setCharCompare(tSet, tOther);
        case TokenType.Single: return new CompareResult(tSet.next, tOther.next);
        case TokenType.Set: return setSetCompare(tSet, tOther);
        case TokenType.AnyString: return anyCompare(tOther, tSet).swapTokens();
    }
}

function anySingleCompare(tAny, tSingle) {
    var nextResult = (tAny.next) ? singleCompare(tSingle, tAny.next).swapTokens() :
        new CompareResult(tAny, tSingle.next);
    return (nextResult.isGood) ? nextResult: new CompareResult(tAny, tSingle.next);
}

function anyCharCompare(tAny, tChar) {
    var nextResult = (tAny.next) ? charCompare(tChar, tAny.next).swapTokens() :
        new CompareResult(tAny, tChar.next);

    return (nextResult.isGood) ? nextResult : new CompareResult(tAny, tChar.next);
}

function charCharCompare(litA, litB) {
    return (litA.val === litB.val) ?
        new CompareResult(litA.next, litB.next) : CompareResult.BadResult;
}

function setCharCompare(tSet, tChar) {
    return (tSet.val.indexOf(tChar.val) > -1) ?
        new CompareResult(tSet.next, tChar.next) : CompareResult.BadResult;
}

function setSetCompare(tSetA, tSetB) {
    var setA = tSetA.val,
        setB = tSetB.val;

    for (var i = 0, il = setA.length; i < il; i++) {
        if (setB.indexOf(setA.charAt(i)) > -1) return new CompareResult(tSetA.next, tSetB.next);
    }
    return CompareResult.BadResult;
}

jsFiddle

Time Complexity

Anything with the words "recursive backtracking" in it is at least O(N2).

任何带有“递归回溯”字样的东西都至少是O(N2)。

Maintainability and Readability

I purposely broke out any branches into there own functions with a singular switch. Assitionally I used named constants when a one character string would suffice. Doing this made the code longer and more verbose, but I think it makes it easier to follow.

我故意用一个奇异的开关把任何分支分解成自己的函数。通常,当一个字符串足够时,我使用命名常量。这样做会使代码更长、更详细,但我认为这样更容易遵循。

Tests

You can view all the tests in the Fiddle. You can view the comments in the Fiddle output to glean their purposes. Each token type was tested against each token type, but I haven't made one that tried all possible comparisons in a single test. I also came up with a few random tough ones like the one below.

你可以在小提琴里看到所有的测试。您可以查看小提琴输出中的注释,以收集它们的目的。每个令牌类型都针对每个令牌类型进行了测试,但是我还没有在一个测试中尝试所有可能的比较。我还随便找了几个棘手的问题,比如下面的那个。

abc[def]?fghi?*nop*[tuv]uv[wxy]?yz => a?[cde]defg*?ilmn[opq]*tu*[xyz]*

abc(def)fghi ? * nop *(德国莱茵)紫外线(wxy)?yz = > ?(cde)defg * ? ilmn[opq]*你*(某某)*

I added an interface on the jsFiddle if anybody wants to test this out themselves. The logging is broken once I added the recursion.

如果有人想自己测试的话,我在jsFiddle添加了一个接口。一旦添加了递归,日志记录就会被破坏。

I don't think I tried enough negative tests, especially with the last version I created.

我认为我没有做过足够的负面测试,尤其是我创建的上一个版本。

Optimization

Currently the solution is a brute force one, but is sufficient to handle any case. I would like to come back to this at some point to improve the time complexity with some simple optimizations.

目前的解决方案是使用蛮力,但对于任何情况都足够了。我想在某个时候回到这个问题上来,通过一些简单的优化来提高时间复杂度。

Checks at the start to reduce comparisons could increase processing time for certain common scenarios. For example, if one pattern starts with a star and one ends with one then we already know they will intersect. I can also check all the characters from the start and end of the patterns and remove them if the match on both patterns. This way they are excluded from any future recursion.

减少比较的开始检查可以增加某些常见场景的处理时间。例如,如果一个模式从一个恒星开始,一个以一个结束,那么我们已经知道它们会相交。我还可以从模式的开始和结束处检查所有字符,并在两个模式上匹配时删除它们。这样,它们就被排除在将来的任何递归之外。

Acknowledgements

I used @m.buettner's tests initially to test my code before I came up with my own. Also I walked through his code to help me understand the problem better.

我用@m。比特纳的测试最初是为了测试我的代码,然后我才想到了我自己的代码。我也浏览了他的代码,以帮助我更好地理解这个问题。

#2


7  

With your very reduced pattern language, the pastebin link in your question and jpmc26's comments are pretty much all the way there: the main question is, whether the literal left and right end of your input strings match. If they do, and both contain at least one *, the strings match (because you can always match the other strings intermediate literal text with that star). There is one special case: if only one of them is empty (after removing pre- and suffix), they can still match if the other consists entirely of *s.

使用非常简化的模式语言,您的问题中的pastebin链接和jpmc26的注释基本上都是这样:主要问题是,您的输入字符串的文字左端和右端是否匹配。如果它们同时包含至少一个*,则字符串匹配(因为您总是可以将其他字符串中间文字文本与该星号匹配)。有一种特殊情况:如果其中一个是空的(在删除了前缀和后缀之后),如果另一个完全是*s,它们仍然可以匹配。

Of course, when checking whether the ends of the string match, you need to take into account the single-character wildcard ? and character classes, too. The single-character wildcard is easy: it cannot fail, because it will always match whatever the other character is. If it's a character class, and the other is just a character, you need to check whether the character is in the class. If they are both classes, you need to check for an intersection of the classes (which is a simple set intersection).

当然,在检查字符串的两端是否匹配时,需要考虑单字符通配符?和字符类。单字符通配符很容易:它不会失败,因为它总是与其他字符匹配。如果它是一个字符类,而另一个只是一个字符,则需要检查该字符是否在类中。如果它们都是类,则需要检查类的交集(这是一个简单的集合交集)。

Here is all of that in JavaScript (check out the code comments to see how the algorithm I outlined above maps to the code):

以下是JavaScript中的所有内容(请查看代码注释,以查看我上面概述的算法如何映射到代码):

var trueInput = [
    { left: 'prefix*', right: 'prefix:extended*' },
    { left: '*suffix', right: '*:extended:suffix' },
    { left: 'left*right', right: 'left*middle*right' },
    { left: 'a*b*c', right: 'a*b*d*b*c' },
    { left: 'hello*', right: '*ok' },
    { left: '*', right: '*'},
    { left: '*', right: '**'},
    { left: '*', right: ''},
    { left: '', right: ''},
    { left: 'abc', right: 'a*c'},
    { left: 'a*c', right: 'a*c'},
    { left: 'a[bc]d', right: 'acd'},
    { left: 'a[bc]d', right: 'a[ce]d'},
    { left: 'a?d', right: 'acd'},
    { left: 'a[bc]d*wyz', right: 'abd*w[xy]z'},
];

var falseInput = [
    { left: 'prefix*', right: 'wrong:prefix:*' },
    { left: '*suffix', right: '*suffix:wrong' },
    { left: 'left*right', right: 'right*middle*left' },
    { left: 'abc', right: 'abcde'},
    { left: 'abcde', right: 'abc'},
    { left: 'a[bc]d', right: 'aed'},
    { left: 'a[bc]d', right: 'a[fe]d'},
    { left: 'a?e', right: 'acd'},
    { left: 'a[bc]d*wyz', right: 'abc*w[ab]z'},
];

// Expects either a single-character string (for literal strings
// and single-character wildcards) or an array (for character
// classes).
var characterIntersect = function(a,b) {
    // If one is a wildcard, there is an intersection.
    if (a === '?' || b === '?')
        return true;

    // If both are characters, they must be the same.
    if (typeof a === 'string' && typeof b === 'string')
        return a === b;

    // If one is a character class, we check that the other
    // is contained in the class.
    if (a instanceof Array && typeof b === 'string')
        return (a.indexOf(b) > -1);
    if (b instanceof Array && typeof a === 'string')
        return (b.indexOf(a) > -1);

    // Now both have to be arrays, so we need to check whether
    // they intersect.
    return a.filter(function(character) {
        return (b.indexOf(character) > -1);
    }).length > 0;
};

var patternIntersect = function(a,b) {
    // Turn the strings into character arrays because they are
    // easier to deal with.
    a = a.split("");
    b = b.split("");

    // Check the beginnings of the string (up until the first *
    // in either of them).
    while (a.length && b.length && a[0] !== '*' && b[0] !== '*')
    {
        // Remove the first character from each. If it's a [,
        // extract an array of all characters in the class.
        aChar = a.shift();
        if (aChar == '[')
        {
            aChar = a.splice(0, a.indexOf(']'));
            a.shift(); // remove the ]
        }
        bChar = b.shift();
        if (bChar == '[')
        {
            bChar = b.splice(0, b.indexOf(']'));
            b.shift(); // remove the ]
        }

        // Check if the two characters or classes overlap.
        if (!characterIntersect(aChar, bChar))
            return false;
    }

    // Same thing, but for the end of the string.
    while (a.length && b.length && a[a.length-1] !== '*' && b[b.length-1] !== '*')
    {
        aChar = a.pop();
        if (aChar == ']')
        {
            aChar = a.splice(a.indexOf('[')+1, Number.MAX_VALUE);
            a.pop(); // remove the [
        }
        bChar = b.pop();
        if (bChar == ']')
        {
            bChar = b.splice(b.indexOf('[')+1, Number.MAX_VALUE);
            b.pop(); // remove the [
        }

        if (!characterIntersect(aChar, bChar))
            return false;
    }

    // If one string is empty, the other has to be empty, too, or
    // consist only of stars.
    if (!a.length && /[^*]/.test(b.join('')) ||
        !b.length && /[^*]/.test(b.join('')))
        return false;

    // The only case not covered above is that both strings contain
    // a * in which case they certainly overlap.
    return true;
};

console.log('Should be all true:');
console.log(trueInput.map(function(pair) {
    return patternIntersect(pair.left, pair.right);
}));

console.log('Should be all false:');
console.log(falseInput.map(function(pair) {
    return patternIntersect(pair.left, pair.right);
}));

It's not the neatest implementation, but it works and is (hopefully) still quite readable. There is a fair bit of code duplication with checking the beginning and the end (which could be alleviated with a simple reverse after checking the beginning - but I figured that would just obscure things). And there are probably tons of other bits that could be greatly improved, but I think the logic is all in place.

它不是最整洁的实现,但它可以工作,而且(希望)仍然是可读的。检查开始和结束有相当多的代码重复(检查开始后可以用简单的反向来缓解),但我认为这只是模糊的东西)。可能还有大量的其他比特可以被大大改进,但是我认为逻辑是完全正确的。

A few more remarks: the implementation assumes that the patterns are well-formatted (no unmatched opening or closing brackets). Also, I took the array intersection code from this answer because it's compact - you could certainly improve on the efficiency of that if necessary.

还有几点:实现假定模式是格式良好的(没有不匹配的开括号或闭括号)。而且,我从这个答案中选取了数组的交集代码,因为它很紧凑——如果需要的话,你当然可以提高它的效率。

Regardless of those implementation details, I think I can answer your complexity question, too: the outer loop goes over both strings at the same time, a character at a time. So that's linear complexity. Everything inside the loop can be done in constant time, except the character class tests. If one character is a character class and the other isn't, you need linear time (with the size of the class being the parameter) to check whether the character is in the class. But this doesn't make it quadratic, because each character in the class means one less iteration of the outer loop. So that's still linear. The most costly thing is hence the intersection of two character classes. This might be more complex that linear time, but the worst it could get is O(N log N): after all, you could just sort both character classes, and then find an intersection in linear time. I think you might even be able to get overall linear time complexity, by hashing the characters in the character class to their Unicode code point (.charCodeAt(0) in JS) or some other number - and finding an intersection in a hashed set is possible in linear time. So, if you really want to, I think you should be able to get down to O(N).

不管这些实现细节如何,我想我也可以回答您的复杂性问题:外部循环同时处理两个字符串,一次一个字符。这就是线性复杂度。除了字符类测试之外,循环中的所有操作都可以在固定的时间内完成。如果一个字符是一个字符类,而另一个不是,则需要线性时间(以类的大小作为参数)来检查字符是否在类中。但这并不是二次函数,因为类中的每个字符都意味着外部循环的迭代次数减少了一次。这仍然是线性的。最昂贵的东西是两个字符类的交集。这可能比线性时间更复杂,但最坏的情况是O(N log N):毕竟,你可以对两个字符类进行排序,然后在线性时间内找到一个交集。我认为您甚至可以通过将字符类中的字符哈希到它们的Unicode代码点(. charcodeat(0)在JS中)或其他一些数字,来获得总体的线性时间复杂度——并且在哈希集中找到交集在线性时间是可能的。所以,如果你真的想要,我认为你应该可以把它降到O(N)

And what is N? The upper limit is sum of the length of both patterns, but in most cases it will actually be less (depending on the length of prefixes and suffixes of both patterns).

N是什么?上限是两个模式长度的总和,但在大多数情况下它实际上会更小(取决于两个模式的前缀和后缀的长度)。

Please point me to any edge-cases my algorithm is missing. I'm also happy about suggested improvements, if they improve or at least don't reduce the clarity of the code.

请给我指出任何我的算法缺失的边缘情况。我还对建议的改进感到高兴,如果改进或至少没有降低代码的清晰度。

Here is a live demo on JSBin (thanks to chakrit for pasting it there).

这是JSBin上的一个实时演示(感谢chakrit将它粘贴到那里)。


EDIT: As Daniel pointed out, there is a conceptual edge-case that my algorithm misses out on. If (before or after elimination of the beginning and end) one string contains no * and the other does, there are cases, where the two still *. Unfortunately, I don't have the time right now to adjust my code snippet to accommodate that problem, but I can outline how to resolve it.

编辑:正如Daniel指出的,有一个概念性的边缘案例,我的算法错过了。如果(在开始和结束之前或之后)一个字符串不包含*和另一个字符串,那么就会出现两个仍然冲突的情况。不幸的是,我现在没有时间调整代码片段以适应这个问题,但是我可以概述如何解决它。

After eliminating both ends of the strings, if both strings are either empty or both contain at least *, they will always match (go through the possible *-distributions after complete elimination to see this). The only case that's not trivial is if one string still contains *, but the other doesn't (be it empty or not). What we now need to do is walk both strings again from left to right. Let me call the string that contains * A and the one that doesn't contain * B.

在消除两个字符串的两端之后,如果两个字符串都是空的,或者两个都包含至少*,那么它们将始终匹配(在完全消除之后,遍历可能的*-distribution以查看此)。唯一不平凡的情况是,如果一个字符串仍然包含*,而另一个字符串不包含*(不管它是否为空)。我们现在需要做的是再次从左到右遍历这两个字符串。让我调用包含* A的字符串和不包含* B的字符串。

We walk A from left to right, skipping all * (paying attention only to ?, character classes and literal characters). For each of the relevant tokens, we check from left to right, if it can be matched in B (stopping at the first occurrence) and advance our B-cursor to that position. If we ever find a token in A that cannot be found in B any more, they do not match. If we manage to find a match for each token in A, they do match. This way, we still use linear time, because there is no backtracking involved. Here are two examples. These two should match:

我们从左到右走A,跳过所有*(只注意?,字符类和文字字符)。对于每个相关的令牌,我们从左到右进行检查,如果可以在B中匹配它(在第一次出现时停止)并将我们的B游标推进到那个位置。如果我们在a中找到了一个在B中再也找不到的令牌,它们就不匹配了。如果我们设法为a中的每个令牌找到匹配,它们就会匹配。这样,我们仍然使用线性时间,因为不涉及回溯。这里有两个例子。这两个应该匹配:

A: *a*[bc]*?*d* --- B: db?bdfdc
    ^                    ^
A: *a*[bc]*?*d* --- B: db?bdfdc
      ^                   ^
A: *a*[bc]*?*d* --- B: db?bdfdc
           ^               ^
A: *a*[bc]*?*d* --- B: db?bdfdc
             ^               ^

These two should not match:

这两者不应该匹配:

A: *a*[bc]*?*d* --- B: dbabdfc
    ^                    ^
A: *a*[bc]*?*d* --- B: dbabdfc
      ^                   ^
A: *a*[bc]*?*d* --- B: dbabdfc
           ^               ^
A: *a*[bc]*?*d* --- B: dbabdfc
             !               

It fails, because the ? cannot possibly match before the second d, after which there is no further d in B to accommodate for the last d in A.

它失败了,因为?不可能在第二个d之前匹配,在第二个d之后,B中不再有d来容纳A中的最后一个d。

This would probably be easy to add to my current implementation, if I had taken the time to properly parse the string into token objects. But now, I'd have to go through the trouble of parsing those character classes again. I hope this written outline of the addition is sufficient help.

如果我花时间将字符串正确地解析为令牌对象,那么很可能很容易添加到当前实现中。但是现在,我不得不再次解析这些字符类。我希望这个补充的书面提纲能有足够的帮助。

PS: Of course, my implementation does also not account for escaping metacharacters, and might choke on * inside character classes.

当然,我的实现也不能解释转义元字符,而且可能会在字符类中阻塞。

#3


6  

These special patterns are considerably less powerful that full regular expressions, but I'll point out that it is possible to do what you want even with general regular expressions. These must be "true" regexes, i.e. those that use only Kleene star, alternation ( the | operation ), and concatenation with any fixed alphabet plus the empty string and empty set. Of course you can also use any syntactic sugar on these ops: one-or-more (+), optional (?). Character sets are just a special kind of alternation [a-c] == a|b|c.

这些特殊的模式远不如完整的正则表达式强大,但我将指出,即使使用一般的正则表达式,也可以做您想做的事情。这些必须是“真正的”regex,即那些只使用Kleene star、alternation(|操作),并连接任何固定的字母表,加上空字符串和空集。当然,您还可以在这些操作上使用任何语法糖:1 -or-more(+),可选(?)字符集只是一种特殊的变换[a-c] = a|b|c。

The algorithm is simple in principle: Convert each regex to a DFA using the standard constructions: Thompson followed by powerset. Then use the cross product construction to compute the intersection DFA of the two originals. Finally check this intersection DFA to determine if it accepts at least one string. This is just a dfs from the start state to see if an accepting state can be reached.

该算法原理简单:使用标准结构将每个regex转换为DFA: Thompson和powerset。然后利用叉乘结构计算出两份原件的交点DFA。最后检查这个交集DFA以确定它是否接受至少一个字符串。这只是从开始状态的dfs,以查看是否可以到达接受状态。

If you are not familiar with these algorithms, it's easy to find Internet references.

如果您不熟悉这些算法,那么很容易找到Internet引用。

If at least one string is accepted by the intersection DFA, there is a match between the original regexes, and the path discovered by the dfs gives a string that satisfies both. Else there is no match.

如果交集DFA接受了至少一个字符串,则原始正则表达式之间有一个匹配,dfs发现的路径提供了一个满足这两个条件的字符串。否则就没有对手了。

#4


5  

Good question!

好问题!

The main complexity here is handling character classes ([...]). My approach is to replace each one with a regular expression that looks for either exactly one of the specified characters (or ?) or another character class that includes at least one of the specified characters. So for [xyz], this would be: ([xyz?]|\[[^\]]*[xyz].*?\]) - see below:

这里的主要复杂性是处理字符类([…])。我的方法是用一个正则表达式替换每个字符,该表达式查找指定的字符(或?)或包含至少一个指定字符的另一个字符类。(某某),这将是:((xyz ?)| \[[^ \]]*[xyz]。* ? \]),见下图:

算法,以查明两个Glob模式(或正则表达式)的匹配是否相交。

Then for "prefixes" (everything before the first *), put ^ at the beginning or for "suffixes" (everything after the last *), put $ at the end.

然后“前缀”(第*)之前的一切,把^初或“后缀”(一切过去后*),最后把美元。

Further details:-

进一步的细节:

  1. Also need to replace all instances of ? with (\[.*?\]|[^\\]]) to make it match either a character class or single character (excluding an opening square bracket).
  2. 还需要替换所有的实例?(\[。* ? \]|[^ \ \]]),使其匹配一个字符类或单个字符(不包括开放方括号)。
  3. Also need to replace each individual character that is not in a character class and is not ? to make it match either the same character, ? or a character class that includes the character. E.g. a would become ([a?]|\[[^\]]*a.*?\]). (A bit long-winded but turned out to be necessary - see comments below).
  4. 还需要替换不属于字符类且不属于字符类的每个字符?为了匹配相同的字符,?或者一个包含字符的字符类。如将成为([?]| \[[^ \]]*。* ? \])。(有点啰嗦,但结果证明是必要的——见下面的评论)。
  5. The testing should be done both ways round as follows: Test prefix #1 converted into regex against prefix #2 then test prefix #2 converted into regex against prefix #1. If either match, the prefixes can be said to "intersect".
  6. 测试应该按照以下两种方式进行:测试前缀#1在前缀#2上转换为regex,然后测试前缀#2在前缀#1上转换为regex。如果两者匹配,前缀可以被称为“intersect”。
  7. Repeat step (3.) for suffixes: For a positive result, both prefixes and suffixes must intersect.
  8. 对于后缀重复步骤(3.):对于一个积极的结果,前缀和后缀都必须相交。

EDIT: In addition to the above, there is a special case when one of the patterns contains at least one * but the other doesn't. In this case, the whole of the pattern with * should be converted into a regular expression: * should match against anything but with the proviso that it only includes whole character classes. This can be done by replacing all instances of * with (\[.*?\]|[^\\]]).

编辑:除了上面提到的,当其中一个模式包含至少一个*,而另一个模式不包含时,会出现特殊情况。在这种情况下,将*的整个模式转换为一个正则表达式:*应该与任何内容相匹配,但前提是它只包含整个字符类。可以通过取代*的所有实例(\[。* ? \]|[^ \ \]])。

To avoid this answer becoming bulky I won't post the full code but there is a working demo with unit tests here: http://jsfiddle.net/mb3Hn/4/

为了避免这个答案变得庞大,我不会发布完整的代码,但是这里有一个带有单元测试的演示:http://jsfiddle.net/mb3Hn/4/

EDIT #2 - Known incompleteness: In its current form, the demo doesn't cater for escaped characters (e.g. \[). Not a great excuse but I only noticed these late in the day - they aren't mentioned in the question, only the link. To handle them, a bit of additional regex complexity would be needed, e.g. to check for non-existence of a backslash immediately before the [. This should be fairly painless with negative lookbehind but unfortunately Javascript doesn't support it. There are workarounds such as reversing both the string and regular expression with negative lookahead but I'm not keen on making the code less readable with this extra complexity and not sure how important it is to the OP so will leave it as an "exercise for ther reader". In retrospect, should maybe have chosen a language with more comprehensive regex support!

编辑#2 -已知的不完全性:在当前的形式中,演示程序不包含转义字符(例如\[)。这不是一个很好的借口,但我只是在那天晚些时候才注意到这些——问题中没有提到它们,只有联系。要处理它们,需要一些额外的regex复杂性,例如,在[之前检查是否存在反斜杠。这应该是相当容易的负面的外观,但不幸的是Javascript不支持它。有一些变通的办法,比如将字符串和正则表达式都颠倒过来,但我不希望代码变得更难以读懂,因为代码会变得更复杂,也不确定它对OP有多重要,因此将它作为“对其他读者的练习”。回想起来,也许应该选择一种更全面的regex支持的语言!

#5


0  

Determining whether a regex matches a subset of another regex using greenery:

确定一个regex是否与使用绿叶的另一个regex的子集匹配:

First, pip3 install https://github.com/ferno/greenery/archive/master.zip.

首先,pip3安装https://github.com/ferno/greenery/archive/master.zip。

Then:

然后:

from greenery.lego import parse as p
a_z = p("[a-z]")
b_x = p("[b-x]")
assert a_z | b_x == a_z
m_n = p("m|n")
zero_nine = p("[0-9]")
assert not m_n | zero_nine == m_n