改进正则表达式以匹配javascript注释

时间:2022-11-22 21:09:32

I used the regex given in perlfaq6 to match and remove javascript comments, but it results in segmentation fault when the string is too long. The regex is -

我使用perlfaq6中给出的正则表达式来匹配和删除javascript注释,但是当字符串太长时会导致分段错误。正则表达式是 -

s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

Can it be improved to avoid segmentation fault ?

可以改进以避免分段错误吗?

[EDIT]

[编辑]

Long input:

长输入:

<ent r=\"6\" t=\"259\" w=\"252\" /><ent r=\"6\" t=\"257\" w=\"219\" />

repeated about a 1000 times.

重复了大约1000次。

1 个解决方案

#1


3  

I suspect the trouble is partly that your 'C code' isn't very much like C code. In C, you can't have the sequence \" outside a pair of quotes, single or double, for example.

我怀疑问题的部分原因是你的'C代码'与C代码不太相似。例如,在C中,您不能将序列“放在一对引号之外,单引号或双引号”。

I adapted the regex to make it readable and wrapped into a trivial script that slurps its input and applies the regex to it:

我调整了正则表达式,使其可读并包装成一个琐碎的脚本,它篡改其输入并将正则表达式应用于它:

#!/usr/bin/env perl

### Original regex from PerlFAQ6.
### s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

undef $/;  # Slurp input

while (<>)
{
    print "raw: $_";

    s%
        /\*[^*]*\*+([^/*][^*]*\*+)*/    # Simple C comments
     |  //([^\\]|[^\n][\n]?)*?\n        # C++ comments, allowing for backslash-newline continuation
     |  (
            "(\\.|[^"\\])*"             # Double-quoted strings
        |   '(\\.|[^'\\])*'             # Single-quoted characters
        |   .[^/"'\\]*                  # Anything else
        )
     %    defined $3 ? $3 : ""
     %egsx;

    print "out: $_";
}

I took your line of non-C code, and created files data.1, data.2, data.4, data.8, ..., data.1024 with the appropriate number of lines in each. I then ran a timing loop.

我拿了你的非C代码行,并创建了文件data.1,data.2,data.4,data.8,...,data.1024,每行中有适当的行数。然后我运行了一个定时循环。

$ for x in 1 2 4 8 16 32 64 128 256 512 1024
> do
>     echo
>     echo $x
>     time perl xx.pl data.$x > /dev/null
> done
$

I've munged the output to give just the real time for the different file sizes:

我已经输出了输出以给出不同文件大小的实时时间:

   1    0m0.022s
   2    0m0.005s
   4    0m0.007s
   8    0m0.013s
  16    0m0.035s
  32    0m0.130s
  64    0m0.523s
 128    0m2.035s
 256    0m6.756s
 512    0m28.062s
1024    1m36.134s

I did not get a core dump (Perl 5.16.0 on Mac OS X 10.7.4; 8 GiB main memory). It does begin to take a significant amount of time. While it was running, it was not growing; during the 1024-line run, it was using about 13 MiB of 'real' memory and 23 MiB of 'virtual' memory.

我没有得到核心转储(Mac OS X 10.7.4上的Perl 5.16.0; 8 GiB主内存)。它确实需要花费大量时间。它在运行时却没有增长;在1024线运行期间,它使用了大约13 MiB的“真实”内存和23 MiB的“虚拟”内存。

I tried Perl 5.10.0 (the oldest version I have compiled on my machine), and it used slightly less 'real' memory, essentially the same 'virtual' memory, and was noticeably slower (33.3s for 512 lines; 1m 53.9s for 1024 lines).

我尝试了Perl 5.10.0(我在我的机器上编译的最旧版本),它使用的“实际”内存略少,基本上是相同的“虚拟”内存,并且速度明显变慢(512行为33.3秒; 1m为53.9秒对于1024行)。

Just for comparison purposes, I collected some C code that I had lying around in the test directory to create a file of about 88 KiB, with 3100 lines of which about 200 were comment lines. This compares with the size of the data.1024 file which was about 77 KiB. Processing that took between 10 and 20 milliseconds.

仅仅为了比较的目的,我收集了一些我在测试目录中存在的C代码来创建一个大约88 KiB的文件,其中3100行,其中大约200个是注释行。这与data.1024文件的大小相比,大约是77 KiB。处理时间在10到20毫秒之间。

Summary

The non-C source you have makes a very nasty test case. Perl shouldn't crash on it.

您拥有的非C源代码是一个非常讨厌的测试用例。 Perl不应该崩溃。

Which version of Perl are you using, and on which platform? How much memory does your machine have. However, total quantity of memory is unlikely to be the issue (24 MiB is not an issue on most machines that run Perl). If you have a very old version of Perl, the results might be different.

您使用的是哪个版本的Perl,以及在哪个平台上?你的机器有多少内存。但是,内存总量不太可能成为问题(在运行Perl的大多数计算机上,24 MiB不是问题)。如果你有一个非常旧版本的Perl,结果可能会有所不同。


I also note that the regex does not handle some pathological C comments that a C compiler must handle, such as:

我还注意到正则表达式不处理C编译器必须处理的一些病态C注释,例如:

/\
\
* Yes, this is a comment *\
\
/
/\
\
/ And so is this

Yes, you'd be right to reject any code submitted for review that contained such comments.

是的,您拒绝提交包含此类评论的任何审核代码是正确的。

#1


3  

I suspect the trouble is partly that your 'C code' isn't very much like C code. In C, you can't have the sequence \" outside a pair of quotes, single or double, for example.

我怀疑问题的部分原因是你的'C代码'与C代码不太相似。例如,在C中,您不能将序列“放在一对引号之外,单引号或双引号”。

I adapted the regex to make it readable and wrapped into a trivial script that slurps its input and applies the regex to it:

我调整了正则表达式,使其可读并包装成一个琐碎的脚本,它篡改其输入并将正则表达式应用于它:

#!/usr/bin/env perl

### Original regex from PerlFAQ6.
### s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

undef $/;  # Slurp input

while (<>)
{
    print "raw: $_";

    s%
        /\*[^*]*\*+([^/*][^*]*\*+)*/    # Simple C comments
     |  //([^\\]|[^\n][\n]?)*?\n        # C++ comments, allowing for backslash-newline continuation
     |  (
            "(\\.|[^"\\])*"             # Double-quoted strings
        |   '(\\.|[^'\\])*'             # Single-quoted characters
        |   .[^/"'\\]*                  # Anything else
        )
     %    defined $3 ? $3 : ""
     %egsx;

    print "out: $_";
}

I took your line of non-C code, and created files data.1, data.2, data.4, data.8, ..., data.1024 with the appropriate number of lines in each. I then ran a timing loop.

我拿了你的非C代码行,并创建了文件data.1,data.2,data.4,data.8,...,data.1024,每行中有适当的行数。然后我运行了一个定时循环。

$ for x in 1 2 4 8 16 32 64 128 256 512 1024
> do
>     echo
>     echo $x
>     time perl xx.pl data.$x > /dev/null
> done
$

I've munged the output to give just the real time for the different file sizes:

我已经输出了输出以给出不同文件大小的实时时间:

   1    0m0.022s
   2    0m0.005s
   4    0m0.007s
   8    0m0.013s
  16    0m0.035s
  32    0m0.130s
  64    0m0.523s
 128    0m2.035s
 256    0m6.756s
 512    0m28.062s
1024    1m36.134s

I did not get a core dump (Perl 5.16.0 on Mac OS X 10.7.4; 8 GiB main memory). It does begin to take a significant amount of time. While it was running, it was not growing; during the 1024-line run, it was using about 13 MiB of 'real' memory and 23 MiB of 'virtual' memory.

我没有得到核心转储(Mac OS X 10.7.4上的Perl 5.16.0; 8 GiB主内存)。它确实需要花费大量时间。它在运行时却没有增长;在1024线运行期间,它使用了大约13 MiB的“真实”内存和23 MiB的“虚拟”内存。

I tried Perl 5.10.0 (the oldest version I have compiled on my machine), and it used slightly less 'real' memory, essentially the same 'virtual' memory, and was noticeably slower (33.3s for 512 lines; 1m 53.9s for 1024 lines).

我尝试了Perl 5.10.0(我在我的机器上编译的最旧版本),它使用的“实际”内存略少,基本上是相同的“虚拟”内存,并且速度明显变慢(512行为33.3秒; 1m为53.9秒对于1024行)。

Just for comparison purposes, I collected some C code that I had lying around in the test directory to create a file of about 88 KiB, with 3100 lines of which about 200 were comment lines. This compares with the size of the data.1024 file which was about 77 KiB. Processing that took between 10 and 20 milliseconds.

仅仅为了比较的目的,我收集了一些我在测试目录中存在的C代码来创建一个大约88 KiB的文件,其中3100行,其中大约200个是注释行。这与data.1024文件的大小相比,大约是77 KiB。处理时间在10到20毫秒之间。

Summary

The non-C source you have makes a very nasty test case. Perl shouldn't crash on it.

您拥有的非C源代码是一个非常讨厌的测试用例。 Perl不应该崩溃。

Which version of Perl are you using, and on which platform? How much memory does your machine have. However, total quantity of memory is unlikely to be the issue (24 MiB is not an issue on most machines that run Perl). If you have a very old version of Perl, the results might be different.

您使用的是哪个版本的Perl,以及在哪个平台上?你的机器有多少内存。但是,内存总量不太可能成为问题(在运行Perl的大多数计算机上,24 MiB不是问题)。如果你有一个非常旧版本的Perl,结果可能会有所不同。


I also note that the regex does not handle some pathological C comments that a C compiler must handle, such as:

我还注意到正则表达式不处理C编译器必须处理的一些病态C注释,例如:

/\
\
* Yes, this is a comment *\
\
/
/\
\
/ And so is this

Yes, you'd be right to reject any code submitted for review that contained such comments.

是的,您拒绝提交包含此类评论的任何审核代码是正确的。