匹配模式后如何在perl.regex中添加短划线后的短划线

时间:2022-12-17 16:50:22

i have this type of data: please help me out i am new to regular expressions,and please explain each step while answering.thanks..

我有这种类型的数据:请帮助我,我是正则表达式的新手,请在回答时解释每一步。谢谢..

7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

i want to extract only this data from above lines:

我想从上面的行中只提取这些数据:

7210315_AX1A_MOTORTRAEGER_VORN_AUSSEN

7210316_W1A_MOTORTRAEGER_VORN_AUSSEN

7210243_U1A_MOTORTRAEGER_VORN_INNEN

7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

then if AX1A contains two consecutive alphabets after underscore ,it should be written as AX_ , and if contains single digit and single alphabet then they become as -1_ and -A_ so after applying this pattern it will become: AX_-1_-A_ and all other data should be remain same.

然后,如果AX1A在下划线后包含两个连续的字母,则应该写为AX_,如果包含单个数字和单个字母,则它们变为-1_和-A_,因此在应用此模式后,它将变为:AX_-1_-A_和所有其他数据应保持不变。

similarly in next line "W1A" so firstly it contains single alphabet "W" which should be converted to -W_ now next character is a single digit so it should also be converted as same pattern -1_ similarly last one is also treated same.so it become -W_-1_-A_

类似地在下一行“W1A”所以首先它包含单个字母“W”,它应该转换为-W_现在下一个字符是单个数字所以它也应该被转换为相同的模式-1_同样最后一个也被视为相同的.so它变成-W_-1_-A_

we are only interested in applying regex to the part after digits followed by underscore.

我们只对将数字后跟下划线应用于正则表达式的部分感兴趣。

_AX1A_

_W1A_

_U1A_

_AV21NA_ 

output should be:

输出应该是:

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN

7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN

7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

5 个解决方案

#1


1  

use strict;
use warnings;

my $match 
    = qr/
    ( \d+          # group of digits
      _            # followed by an underscore
    )              # end group
    ( \p{Alpha}+ ) # group of alphas             
    ( \d+ )        # group of digits
    ( \p{Alpha}* ) # group of alphas
    ( \w+ )        # group of word characters
    /x
    ;

while ( my $record = <$input> ) { # record of input
    # match and capture
    if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
        say $pre 
             # if the alpha has length 1, add a dash before it
          . ( length $pre_alpha == 1 ? '-' : '' )
            # then the alpha
          . $pre_alpha
            # then the underscore
          . '_'
            # test if the length of the number is 1 and the length of the 
            # trailing alpha string is 1 
          . ( length( $num ) == 1 && length( $post_alpha ) == 1
              # if true, apply a dash before each 
            ? "-$num\_-$post_alpha" 
              # otherwise treat as AV21NA in example.
            : "$num\_$post_alpha"
            )
          . $post
          ;

    }
}

#2


1  

I don't know all the ins and outs of what you need stripped, but I'll extrapolate and let you clarify if this doesn't do quite what you need.

我不知道你需要剥离的所有细节,但我会推断并让你澄清这是否不能满足你的需要。

For the first step, extracting the 1X50_RE_ and 1X50_LI, you could search for those strings and replace them with nothing.

第一步,提取1X50_RE_和1X50_LI,您可以搜索这些字符串并将其替换为空。

Next, to split your second letter/number code into your small chunks, you can use a pair of matches, using a look-ahead on each. However, since you only want to mess with that second code chunk, I'd split the overall line up first, work on the second chunk, and then join the pieces back together again.

接下来,要将您的第二个字母/数字代码分成小块,您可以使用一对匹配,每个匹配都使用前瞻。但是,由于你只想弄乱第二个代码块,我首先将整个行分开,然后处理第二个块,然后再将这些块重新组合在一起。

while (<$input>) {

    # Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
    s/1X50_(RE|LI)_//;

    my @pieces = split /_/; # split the line into pieces at each underscore

    # Just working with the second chunk. /g, means do it for all matches found
    $pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
    $pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A

    # Join the pieces back together again
    $_ = join '_', @pieces;

    print;
}

The $_ is the variable many Perl operations work on if you don't specify. The <$input> reads the next line of the file handle named $input into $_. The s///, split, and print functions work on $_ when not given. The =~ operator is the way you tell Perl to use $pieces[1] (or whichever variable you are working on) instead of $_ for regular expression operations. (For split or print, you'd pass the variables as the argument instead, so split /_/ is the same as split /_/, $_ and print is the same as print $_.)

如果你没有指定,$ _是很多Perl操作的变量。 <$ input>将名为$ input的文件句柄的下一行读入$ _。未给出时,s ///,split和print函数在$ _上工作。 =〜运算符是你告诉Perl使用$件[1](或你正在处理的变量)而不是$ _用于正则表达式操作的方式。 (对于拆分或打印,您将传递变量作为参数,因此split / _ /与split / _ /,$ _相同,print与print $ _相同。)

Oh, and to explain the regular expressions a bit:

哦,并解释一下正则表达式:

s/1X50_(RE|LI)_//;

This is matching anything containing 1X50_RE or 1X50_LI (the (|) is a list of alternatives) and replacing them with nothing (the empty // at the end).

这匹配包含1X50_RE或1X50_LI的任何内容((|)是备选列表)并将其替换为空(最后为空//)。

Looking at one of the other lines:

看看其他一条线:

s/([A-Z])(?=[0-9])/$1_-/g;

The plain parentheses (...) around [A-Z] cause $1 to be set to whatever letter is matched inside (in this case a letter, A-Z). The (?=...) parenthesis cause a zero-width positive look-ahead assertion. That means the regular expression only matches if the very next thing in the string matches the expression (a digit, 0-9), but that part of the match is not included as part of the string that is replaced.

[A-Z]周围的普通括号(...)会将$ 1设置为内部匹配的任何字母(在本例中为字母A-Z)。 (?= ...)括号引起零宽度正向前瞻断言。这意味着正则表达式仅匹配字符串中的下一个匹配表达式(数字,0-9),但匹配的那部分不包括在被替换的字符串的一部分中。

The /$1_-/ causes the matched part of the string, the [A-Z], to be replaced with the value captured by the parentheses, (...), but before the look-head, [0-9], with the addition of the _- you require.

/ $ 1 _- /导致字符串的匹配部分[AZ]被括号(...)捕获的值替换,但在查找头部之前,[0-9],添加_-你需要的。

#3


1  

#!/usr/bin/perl -w
use strict;
while (<>) {
    next if /^\s*$/;
    chomp;
    ## Remove those parts of the line we do not want
    ## You do not specify what, if anything, is constant about
    ## the parts you do not want. One of the following cases should
    ## serve.

    ## i) Remove the string _1X50_ and the next characters between
    ## two underscores:
    s/_1X50_.+?_/_/;

    ## ii) keep the first 2 and last 3 sections of each line.
    ## Uncomment this line and comment the previous one to use this:
    #s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;

    ## The line now contains only those regions we are 
    ## interested in. Split on '_' to collect an array of the
    ## different parts (@a):
    my @a=split(/_/);

    ## $a[1] is the second string, eg AX1A,W1A etc.
    ## We search for one or more letters, followed by one or more digits
    ## followed by one or more letters. The 'i' operand makes the match
    ## case Insensitive and the 'g' operand makes the search global, allowing
    ## us to capture the matches in the @matches array. 
    my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);

    ## So, for each of the matched strings, if the length of the match
    ## is less than 2, add a '-' to the beginning of the string:
    foreach my $match (@matches) {
        if (length($match)<2) {
        $match="-" . $match;
        }
    }
    ## Now replace the original $a[1] with each string in
    ## @matches, connected by '_':
    $a[1]=join("_", @matches);

    ## Finally, build the string $kk by joining each element
    ## of the line (@a) by a '_', and print:
    my $kk=join("_", @a);
    print "$kk\n";
}

#4


1  

Are you sure like this:

你确定是这样的:

while (<DATA>) {
    s/1X50_(LI|RE)_//;
    s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
    s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
    s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
    print;
}

__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

output:

输出:

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

#5


-1  

zostay's suggestion of splitting the line may make things easier if you are a regex beginner. However, avoiding the split is optimal from a performance perspective. Here is how to do it without splitting:

如果你是一个正则表达式的初学者,zostay关于分割线的建议可能会使事情变得更容易。但是,从性能角度来看,避免拆分是最佳选择。以下是如何在不拆分的情况下执行此操作:

open IN_FILE, "filename" or die "Whoops!  Can't open file.";
while (<IN_FILE>)
{
     s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/ 
          or print "line didn't match: $line\n";
     s/1X50_(LI|RE)_//;
}

Breaking down the first pattern: s/// is the search-and-replace operator. ^ match the beginning of the line \d{7}_ match seven digits, followed by an underscore \K look-behind operator. This means that whatever came before won't be part of the string that is replaced. () each set of parentheses specifies a chunk of the match that will be captured. These will be put into the match variables $1, $2, etc. in order. [A-Z]{1,2} this means match between one and two capital letters. You can probably figure out what the other two sections in parentheses mean. -${1}-${2}-${3} Replace what matched with the first three match variables, preceded by dashes. The only reason for the curly braces is to make clear what the variable name is.

分解第一个模式:s ///是搜索和替换运算符。 ^匹配行的开头\ d {7} _匹配七位数字,后跟一个下划线\ K后视操作符。这意味着之前发生的任何事情都不会成为被替换的字符串的一部分。 ()每组括号指定将捕获的匹配块。这些将按顺序放入匹配变量$ 1,$ 2等。 [A-Z] {1,2}这意味着一到两个大写字母之间的匹配。您可以弄清楚括号中其他两个部分的含义。 - $ {1} - $ {2} - $ {3}替换前三个匹配变量匹配的内容,前面有破折号。花括号的唯一原因是要弄清楚变量名是什么。

#1


1  

use strict;
use warnings;

my $match 
    = qr/
    ( \d+          # group of digits
      _            # followed by an underscore
    )              # end group
    ( \p{Alpha}+ ) # group of alphas             
    ( \d+ )        # group of digits
    ( \p{Alpha}* ) # group of alphas
    ( \w+ )        # group of word characters
    /x
    ;

while ( my $record = <$input> ) { # record of input
    # match and capture
    if ( my ( $pre, $pre_alpha, $num, $post_alpha, $post ) = $record =~ m/$match/ ) {
        say $pre 
             # if the alpha has length 1, add a dash before it
          . ( length $pre_alpha == 1 ? '-' : '' )
            # then the alpha
          . $pre_alpha
            # then the underscore
          . '_'
            # test if the length of the number is 1 and the length of the 
            # trailing alpha string is 1 
          . ( length( $num ) == 1 && length( $post_alpha ) == 1
              # if true, apply a dash before each 
            ? "-$num\_-$post_alpha" 
              # otherwise treat as AV21NA in example.
            : "$num\_$post_alpha"
            )
          . $post
          ;

    }
}

#2


1  

I don't know all the ins and outs of what you need stripped, but I'll extrapolate and let you clarify if this doesn't do quite what you need.

我不知道你需要剥离的所有细节,但我会推断并让你澄清这是否不能满足你的需要。

For the first step, extracting the 1X50_RE_ and 1X50_LI, you could search for those strings and replace them with nothing.

第一步,提取1X50_RE_和1X50_LI,您可以搜索这些字符串并将其替换为空。

Next, to split your second letter/number code into your small chunks, you can use a pair of matches, using a look-ahead on each. However, since you only want to mess with that second code chunk, I'd split the overall line up first, work on the second chunk, and then join the pieces back together again.

接下来,要将您的第二个字母/数字代码分成小块,您可以使用一对匹配,每个匹配都使用前瞻。但是,由于你只想弄乱第二个代码块,我首先将整个行分开,然后处理第二个块,然后再将这些块重新组合在一起。

while (<$input>) {

    # Replace the 1X50_RE/LI_ bits with nothing (i.e., delete them)
    s/1X50_(RE|LI)_//;

    my @pieces = split /_/; # split the line into pieces at each underscore

    # Just working with the second chunk. /g, means do it for all matches found
    $pieces[1] =~ s/([A-Z])(?=[0-9])/$1_-/g; # Convert AX1 -> AX_-1
    $pieces[1] =~ s/([0-9])(?=[A-Z])/$1_-/g; # Convert 1A -> 1-_A

    # Join the pieces back together again
    $_ = join '_', @pieces;

    print;
}

The $_ is the variable many Perl operations work on if you don't specify. The <$input> reads the next line of the file handle named $input into $_. The s///, split, and print functions work on $_ when not given. The =~ operator is the way you tell Perl to use $pieces[1] (or whichever variable you are working on) instead of $_ for regular expression operations. (For split or print, you'd pass the variables as the argument instead, so split /_/ is the same as split /_/, $_ and print is the same as print $_.)

如果你没有指定,$ _是很多Perl操作的变量。 <$ input>将名为$ input的文件句柄的下一行读入$ _。未给出时,s ///,split和print函数在$ _上工作。 =〜运算符是你告诉Perl使用$件[1](或你正在处理的变量)而不是$ _用于正则表达式操作的方式。 (对于拆分或打印,您将传递变量作为参数,因此split / _ /与split / _ /,$ _相同,print与print $ _相同。)

Oh, and to explain the regular expressions a bit:

哦,并解释一下正则表达式:

s/1X50_(RE|LI)_//;

This is matching anything containing 1X50_RE or 1X50_LI (the (|) is a list of alternatives) and replacing them with nothing (the empty // at the end).

这匹配包含1X50_RE或1X50_LI的任何内容((|)是备选列表)并将其替换为空(最后为空//)。

Looking at one of the other lines:

看看其他一条线:

s/([A-Z])(?=[0-9])/$1_-/g;

The plain parentheses (...) around [A-Z] cause $1 to be set to whatever letter is matched inside (in this case a letter, A-Z). The (?=...) parenthesis cause a zero-width positive look-ahead assertion. That means the regular expression only matches if the very next thing in the string matches the expression (a digit, 0-9), but that part of the match is not included as part of the string that is replaced.

[A-Z]周围的普通括号(...)会将$ 1设置为内部匹配的任何字母(在本例中为字母A-Z)。 (?= ...)括号引起零宽度正向前瞻断言。这意味着正则表达式仅匹配字符串中的下一个匹配表达式(数字,0-9),但匹配的那部分不包括在被替换的字符串的一部分中。

The /$1_-/ causes the matched part of the string, the [A-Z], to be replaced with the value captured by the parentheses, (...), but before the look-head, [0-9], with the addition of the _- you require.

/ $ 1 _- /导致字符串的匹配部分[AZ]被括号(...)捕获的值替换,但在查找头部之前,[0-9],添加_-你需要的。

#3


1  

#!/usr/bin/perl -w
use strict;
while (<>) {
    next if /^\s*$/;
    chomp;
    ## Remove those parts of the line we do not want
    ## You do not specify what, if anything, is constant about
    ## the parts you do not want. One of the following cases should
    ## serve.

    ## i) Remove the string _1X50_ and the next characters between
    ## two underscores:
    s/_1X50_.+?_/_/;

    ## ii) keep the first 2 and last 3 sections of each line.
    ## Uncomment this line and comment the previous one to use this:
    #s/^(.+?_.+?)_.+_(.+_.+_.+)$/$1_$2/;

    ## The line now contains only those regions we are 
    ## interested in. Split on '_' to collect an array of the
    ## different parts (@a):
    my @a=split(/_/);

    ## $a[1] is the second string, eg AX1A,W1A etc.
    ## We search for one or more letters, followed by one or more digits
    ## followed by one or more letters. The 'i' operand makes the match
    ## case Insensitive and the 'g' operand makes the search global, allowing
    ## us to capture the matches in the @matches array. 
    my @matches=($a[1]=~/^([a-z]*)(\d*)([a-z]*)/ig);

    ## So, for each of the matched strings, if the length of the match
    ## is less than 2, add a '-' to the beginning of the string:
    foreach my $match (@matches) {
        if (length($match)<2) {
        $match="-" . $match;
        }
    }
    ## Now replace the original $a[1] with each string in
    ## @matches, connected by '_':
    $a[1]=join("_", @matches);

    ## Finally, build the string $kk by joining each element
    ## of the line (@a) by a '_', and print:
    my $kk=join("_", @a);
    print "$kk\n";
}

#4


1  

Are you sure like this:

你确定是这样的:

while (<DATA>) {
    s/1X50_(LI|RE)_//;
    s/(\d+)_([A-Z])(\d)([A-Z])/$1_-$2_-$3_-$4/;
    s/(\d+)_([A-Z]{2})(\d)([A-Z])/$1_$2_-$3_-$4/;
    s/(\d+)_([A-Z]{1,2})(\d+)([A-Z]+)/$1_$2_$3_$4/;
    print;
}

__DATA__
7210315_AX1A_1X50_LI_MOTORTRAEGER_VORN_AUSSEN
7210316_W1A_1X50_RE_MOTORTRAEGER_VORN_AUSSEN
7210243_U1A_1X50_LI_MOTORTRAEGER_VORN_INNEN
7210330_AV21NA_ABSTUETZUNG_STUETZTRAEGER_RAD

output:

输出:

7210315_AX_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210316_-W_-1_-A_MOTORTRAEGER_VORN_AUSSEN
7210243_-U_-1_-A_MOTORTRAEGER_VORN_INNEN
7210330_AV_21_NA_ABSTUETZUNG_STUETZTRAEGER_RAD

#5


-1  

zostay's suggestion of splitting the line may make things easier if you are a regex beginner. However, avoiding the split is optimal from a performance perspective. Here is how to do it without splitting:

如果你是一个正则表达式的初学者,zostay关于分割线的建议可能会使事情变得更容易。但是,从性能角度来看,避免拆分是最佳选择。以下是如何在不拆分的情况下执行此操作:

open IN_FILE, "filename" or die "Whoops!  Can't open file.";
while (<IN_FILE>)
{
     s/^\d{7}_\K([A-Z]{1,2})(\d{1,2})([A-Z]{1,2})/-${1}-${2}-${3}/ 
          or print "line didn't match: $line\n";
     s/1X50_(LI|RE)_//;
}

Breaking down the first pattern: s/// is the search-and-replace operator. ^ match the beginning of the line \d{7}_ match seven digits, followed by an underscore \K look-behind operator. This means that whatever came before won't be part of the string that is replaced. () each set of parentheses specifies a chunk of the match that will be captured. These will be put into the match variables $1, $2, etc. in order. [A-Z]{1,2} this means match between one and two capital letters. You can probably figure out what the other two sections in parentheses mean. -${1}-${2}-${3} Replace what matched with the first three match variables, preceded by dashes. The only reason for the curly braces is to make clear what the variable name is.

分解第一个模式:s ///是搜索和替换运算符。 ^匹配行的开头\ d {7} _匹配七位数字,后跟一个下划线\ K后视操作符。这意味着之前发生的任何事情都不会成为被替换的字符串的一部分。 ()每组括号指定将捕获的匹配块。这些将按顺序放入匹配变量$ 1,$ 2等。 [A-Z] {1,2}这意味着一到两个大写字母之间的匹配。您可以弄清楚括号中其他两个部分的含义。 - $ {1} - $ {2} - $ {3}替换前三个匹配变量匹配的内容,前面有破折号。花括号的唯一原因是要弄清楚变量名是什么。