正则表达式解析C＃源代码以查找所有字符串

I asked this question a long time ago, I wish I had read the answers to When not to use Regex in C# (or Java, C++ etc) first!

我很久以前就问过这个问题,我希望我已经阅读过如何在C#(或Java,C ++等)中不使用Regex的答案!

I wish to use Regex (regular expressions) to get a list of all strings in my C# source code, including strings that have double quotes embedded in them.

我希望使用Regex(正则表达式)来获取C#源代码中所有字符串的列表,包括嵌入了双引号的字符串。

This should not be hard, however before I spend time trying to build the Regex expression up, has anyone got a “pre canned” one already?

这应该不难,但是在我花时间尝试构建正则表达式之前,有没有人有一个“预制罐头”表达式?

This is not as easy as it seems as first due to

这并不像第一次那样容易

“av\”d”
@”ab””cd”
@”ab”””
@”””ab”
etc

4 个解决方案

#1

I am posting this as my answer so it stands out to other reading the questions.

我发布这个作为我的答案,所以它在其他人阅读问题时脱颖而出。

As has been pointed out in the helpful comments to my question, it is clear that regex is not a good tool for finding strings in C# code. I could have written a simple “parser” in the time I spent reminding my self of the regex syntax. – (Parser is a over statement as there are no “ in comments etc, it is my source code I am dealing with.)

正如我在问题的有用评论中指出的那样,很明显正则表达式不是在C#代码中查找字符串的好工具。在我提醒我自己的正则表达式语法时,我本可以编写一个简单的“解析器”。 - (Parser是一个过度声明,因为没有“在评论中等,这是我正在处理的源代码。)

This seems to sums it up well:

这似乎总结得很好:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

有些人在遇到问题时会想“我知道,我会使用正则表达式。”现在他们有两个问题。

However until it breaks on my code I will use the regular expression Blixt has posted, but if it give me problems I will not spend match time trying to fix it before writing my own parser. E.g as a C# string it is

然而,直到它打破我的代码,我将使用Blixt发布的正则表达式,但如果它给我带来问题,我将不会花费匹配时间来尝试修复它,然后再编写我自己的解析器。例如,它是一个C#字符串

@"@Q(?:[^Q]+|QQ)*Q|Q(?:[^Q\\]+|\\.)*Q".Replace('Q', '\"')

Update, the above regEx had problem, so I just wrote my own parser, including writing unit tests it took about 2 hours to write the parser. That's I lot less time then I spend just trying to find (and test) a pre-canned Regex on the web.

更新,上面的regEx有问题,所以我只写了我自己的解析器,包括编写单元测试花了大约2个小时来编写解析器。这比我在网上试图找到(并测试)预先制作的正则表达式的时间少得多。

The problem I see to have, is I tend to avoid Regex and just write the string handling code my self, then have a lot of people claim I am wasting the client’s money by not using Regex. However whenever I try to use Regex what seems like a simple match pattern becomes match harder quickly. (None the on-line articles on using Regex in .net that I have read, have a good instruction that make it clear when NOT to use Regex. Likewise with it’s MSDN documentation)

我看到的问题是,我倾向于避免正则表达式,只是自己编写字符串处理代码,然后有很多人声称我通过不使用正则表达式来浪费客户的钱。然而,每当我尝试使用正则表达式时,看似简单的匹配模式就会变得更加快速匹配。 (没有关于在我阅读的.net中使用Regex的在线文章,有一个很好的指令,在不使用正则表达式时明确说明。同样使用它的MSDN文档)

Lets see if we can help solve this problem, I have just created a stack overflow questions “When not to use Regex”

让我们看看我们是否可以帮助解决这个问题,我刚刚创建了一个堆栈溢出问题“何时不使用正则表达式”

#2

The regular expression for finding C-style strings is:

用于查找C样式字符串的正则表达式是:

"(?:[^"\\]+|\\.)*"

This will not take comments into consideration, so your best bet would be to remove all comments first, using the following regular expression:

这不会考虑注释,因此最好的方法是首先删除所有注释,使用以下正则表达式:

/\*(?s:(?!\*/).)*\*/|//.*

Note that if you put the above regular expressions in a string you'll need to double all backslashes and escape any citation marks.

请注意,如果将上述正则表达式放在一个字符串中,则需要将所有反斜杠加倍并转义任何引号。

Update: Changed regular expression for comments to use DOTALL flag for multi-line comments.

更新:更改了注释的正则表达式,以便对多行注释使用DOTALL标志。

Also, you may want to support literal strings, so use this instead of the other string regex:

此外,您可能希望支持文字字符串,因此请使用此字符串而不是其他字符串正则表达式:

@"(?:[^"]+|"")*"|"(?:[^"\\]+|\\.)*"

And a reminder: Don't use DOTALL as a global flag for any of these regular expressions, as it would break the single-line comments and single-line strings (normal strings are single-line, while literal strings can span multiple lines.)

并提醒一下:不要将DOTALL用作任何这些正则表达式的全局标志,因为它会破坏单行注释和单行字符串(正常字符串是单行,而文字字符串可以跨越多行。 )

#3

Via www.regular-expressions.info:

"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this regular expression may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string. "[^"\\]*(?:\\.[^"\\]*)*" allows the string to span multiple lines.

“[^”\\\ r \ n] *(?:\\。[^“\\\ r \ n] *)*”匹配一个单行字符串,如果它被转义,则引号字符可以出现在该字符串中反斜杠。虽然这个正则表达式可能看起来比它需要的更复杂,但它比简单的解决方案快得多,如果双引号出现在某个地方而不是字符串的一部分,它会导致大量的回溯。 “[^”\\] *(?:\\。[^“\\] *)*”允许字符串跨越多行。

#4

My 5 cents expression i use in my own C# parser:

我在我自己的C#解析器中使用的5美分表达式:

normal string:

"((\")|[^"\]|\)"

verbatim string:

@("[^"]*")+

#1