使用perl中的regex从字符串中提取子字符串?

时间:2022-09-13 13:39:30

try to extract for substrings that match pattern in string. for example i have text like the one below

尝试提取匹配string中pattern的子串。例如,我有类似下面的文字

[ Pierre/NNP Vinken/NNP ]
,/, 
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB 
[ the/DT board/NN ]
as/IN 
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./. 
[ Mr./NNP Vinken/NNP ]
is/VBZ 
[ chairman/NN ]
of/IN 

and i want to extract whatever before slash (/) and whatever after slash, but somehow my regex extracts the first substring and ignore the rest of substrings in the line.

我希望在斜杠(/)和斜杠之后提取任何东西,但不知何故,我的正则表达式提取第一个子字符串并忽略该行中的其余子字符串。

my output is something like this below :

我的输出如下所示:

tag:Pierre/NNP Vinken - word:Pierre/NNP Vinken/NNP ->1
tag:, - word:,/, ->1
tag:61/CD years - word:61/CD years/NNS ->1
tag:old/JJ ,/, will/MD join - word:old/JJ ,/, will/MD join/VB ->1
tag:the/DT board - word:the/DT board/NN ->1
tag:as - word:as/IN ->1
tag:a/DT nonexecutive/JJ director/NN Nov./NNP 29 - word:a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ->1
tag:. - word:./. ->1
tag:Mr./NNP Vinken - word:Mr./NNP Vinken/NNP ->1
tag:is - word:is/VBZ ->1
tag:chairman - word:chairman/NN ->1
tag:of - word:of/IN ->1

but what i am actually want is something like this below

但我真正想要的是下面这样的东西

tag:NNP  - word:Pierre ->1
tag:NNP  - word:Vinken ->1
tag:,    - word:,      ->1
tag:CD   - word:61     ->1
.
.
etc.

code i used :

我使用的代码:

    while (my $line = <$fh>) {
        chomp $line;
        #remove square brackets
        $line=~s/[\[\]]//;

        while($line =~m/((\s*(.*))\/((.*)\s+))/gi)
        {
            $word=$1;
            $tag=$2;
            #remove whitespace from left and right of string
            $word=~ s/^\s+|\s+$//g;
            $tag=~ s/^\s+|\s+$//g;
            $tags{$tag}++;
            $tagHash{$tag}{$word}++;
        }

    }
foreach my $str (sort keys %tagHash)
{
    foreach my $s (keys %{$tagHash{$str}} )
    {
        print "tags:$str - word: $s-> $tagHash{$str}{$s}\n";
    }
}

any idea why my regex does not behave as should be

任何想法为什么我的正则表达式不应该表现出来

EDIT:

in text files that i am parsing has wild character and punctuation as well, which is mean that files will have something like this : ''/'' "/" ,/, ./. ?/? !/! . . . etc

在我正在解析的文本文件中也有野性字符和标点符号,这意味着文件将具有如下内容:''/''“/”,/,。/。 ?/? !/! 。 。 。等等

so i want to capture all of these things not only alphabetic and numeric characters.

所以我想要捕获所有这些东西,不仅仅是字母和数字字符。

2 个解决方案

#1


1  

The outer-most set of parentheses, around your whole pattern, gets captured into $1, what is clearly not intended. Also, the greediness of .*\/ means that it takes everything up to the last /. Likewise, .*\s+ leaves only the very last space.

围绕整个模式的最外面一组括号被捕获到$ 1,这显然不是预期的。此外,。* \ /的贪婪意味着它需要一切到最后/。同样,。* \ s +只留下最后一个空格。

One way to do this is by using the negated character class

一种方法是使用否定的字符类

my ($word, $tag) = m{ ([^/\s]+) / ([^/\s]+) }x;

The pattern [^/\s]+ matches a string of one-or-more consecutive characters, each being any other than / or whitespace. So you get a "word" before and after /. If you take "whatever after slash" as the text says it is unclear what should be before the next slash.

模式[^ / \ s] +匹配一个或多个连续字符的字符串,每个字符不是/或空格。所以你在/之前和之后得到一个“单词”。如果你采取“斜线后的任何东西”,正如文本所说,在下一个斜线之前不清楚应该是什么。

Your approach can then go as

然后你的方法就可以了

while (my $line = <$fh>) 
{
    while ( $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx )
    {
        $tagHash{$2}{$1}++;
    }
}

The other count seems unrelated so I left it out to focus on the question.

另一个计数似乎无关紧要,所以我把它留下来专注于这个问题。


However, there is a big bit missing here.

但是,这里有一点点缺失。

This approach cannot detect when a line differs from the expected format. For example

此方法无法检测线条何时与预期格式不同。例如

word1/tag1 word2/tag2/ tag3/word4/tag4

produces wrong results, quietly. Some violations get skipped, but there are many bad cases.

安静地产生错误的结果。有些违规行为会被跳过,但有很多不良案例。

One way to catch this is to pre-process the line, checking that there are at least two words between all slashes and at least one before first and after last. This means that each line is processed twice, and it also gets messier. For example

捕获这一点的一种方法是预处理该行,检查所有斜杠之间至少有两个单词,并且在第一个和最后一个之后至少有一个单词。这意味着每行处理两次,并且它也变得更加混乱。例如

while (my $line = <$fh>) 
{
    my @parts = split '/', $line;
    if (not shift @parts or not pop @parts or grep { 2 > split } @parts) {
        warn "Unexpected format: $line";
        next;
    }

    $tagHash{$2}{$1}++  while $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx;
}

This check changes the @parts array, so if that array is needed later then better use

此检查会更改@parts数组,因此如果稍后需要该数组,则可以更好地使用

if (!$parts[0] or !$parts[-1] or grep { 2 > split } @parts[1..@parts-2])  { ...

where instead of grep one can also use the short-circuiting any from List::Util

而不是grep,也可以使用List :: Util中的任何短路

Another way would be to change the approach, and parse the line carefully instead of blindly hopping over regex matches. Since the first and last may have only one word this may be hard to do with a regex. It is probably clearer and more practical to just split and work with the array.

另一种方法是改变方法,仔细解析线,而不是盲目地跳过正则表达式匹配。由于第一个和最后一个可能只有一个单词,这可能很难用正则表达式。分割和使用数组可能更清晰,更实用。

It is hard to imagine a format always matching data so I'd suggest to consider some of this.

很难想象格式总是匹配数据,所以我建议考虑其中的一些。

#2


2  

I think you have tag/words that tag and word may be everything, except some characters like ],[,\s,:

我认为你有标签/单词标签和单词可能是一切,除了一些字符,如],[,\ s,:

\s*([^\[\]\s]+?)\/([^\[\]\s]+)\s*
    ^^^^^^^^^1

This regex is similar to your original pattern. (See DEMO)

此正则表达式与您的原始模式类似。 (见DEMO)

Description:

1- This Capturing Group match every character . that is not [,] or \s

1-此捕获组匹配每个角色。那不是[,]或\ s

#1


1  

The outer-most set of parentheses, around your whole pattern, gets captured into $1, what is clearly not intended. Also, the greediness of .*\/ means that it takes everything up to the last /. Likewise, .*\s+ leaves only the very last space.

围绕整个模式的最外面一组括号被捕获到$ 1,这显然不是预期的。此外,。* \ /的贪婪意味着它需要一切到最后/。同样,。* \ s +只留下最后一个空格。

One way to do this is by using the negated character class

一种方法是使用否定的字符类

my ($word, $tag) = m{ ([^/\s]+) / ([^/\s]+) }x;

The pattern [^/\s]+ matches a string of one-or-more consecutive characters, each being any other than / or whitespace. So you get a "word" before and after /. If you take "whatever after slash" as the text says it is unclear what should be before the next slash.

模式[^ / \ s] +匹配一个或多个连续字符的字符串,每个字符不是/或空格。所以你在/之前和之后得到一个“单词”。如果你采取“斜线后的任何东西”,正如文本所说,在下一个斜线之前不清楚应该是什么。

Your approach can then go as

然后你的方法就可以了

while (my $line = <$fh>) 
{
    while ( $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx )
    {
        $tagHash{$2}{$1}++;
    }
}

The other count seems unrelated so I left it out to focus on the question.

另一个计数似乎无关紧要,所以我把它留下来专注于这个问题。


However, there is a big bit missing here.

但是,这里有一点点缺失。

This approach cannot detect when a line differs from the expected format. For example

此方法无法检测线条何时与预期格式不同。例如

word1/tag1 word2/tag2/ tag3/word4/tag4

produces wrong results, quietly. Some violations get skipped, but there are many bad cases.

安静地产生错误的结果。有些违规行为会被跳过,但有很多不良案例。

One way to catch this is to pre-process the line, checking that there are at least two words between all slashes and at least one before first and after last. This means that each line is processed twice, and it also gets messier. For example

捕获这一点的一种方法是预处理该行,检查所有斜杠之间至少有两个单词,并且在第一个和最后一个之后至少有一个单词。这意味着每行处理两次,并且它也变得更加混乱。例如

while (my $line = <$fh>) 
{
    my @parts = split '/', $line;
    if (not shift @parts or not pop @parts or grep { 2 > split } @parts) {
        warn "Unexpected format: $line";
        next;
    }

    $tagHash{$2}{$1}++  while $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx;
}

This check changes the @parts array, so if that array is needed later then better use

此检查会更改@parts数组,因此如果稍后需要该数组,则可以更好地使用

if (!$parts[0] or !$parts[-1] or grep { 2 > split } @parts[1..@parts-2])  { ...

where instead of grep one can also use the short-circuiting any from List::Util

而不是grep,也可以使用List :: Util中的任何短路

Another way would be to change the approach, and parse the line carefully instead of blindly hopping over regex matches. Since the first and last may have only one word this may be hard to do with a regex. It is probably clearer and more practical to just split and work with the array.

另一种方法是改变方法,仔细解析线,而不是盲目地跳过正则表达式匹配。由于第一个和最后一个可能只有一个单词,这可能很难用正则表达式。分割和使用数组可能更清晰,更实用。

It is hard to imagine a format always matching data so I'd suggest to consider some of this.

很难想象格式总是匹配数据,所以我建议考虑其中的一些。

#2


2  

I think you have tag/words that tag and word may be everything, except some characters like ],[,\s,:

我认为你有标签/单词标签和单词可能是一切,除了一些字符,如],[,\ s,:

\s*([^\[\]\s]+?)\/([^\[\]\s]+)\s*
    ^^^^^^^^^1

This regex is similar to your original pattern. (See DEMO)

此正则表达式与您的原始模式类似。 (见DEMO)

Description:

1- This Capturing Group match every character . that is not [,] or \s

1-此捕获组匹配每个角色。那不是[,]或\ s