为什么人们使用regexp进行电子邮件和其他复杂的验证?

时间:2022-10-29 16:58:52

There are a number of email regexp questions popping up here, and I'm honestly baffled why people are using these insanely obtuse matching expressions rather than a very simple parser that splits the email up into the name and domain tokens, and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion) and the valid characters for the domain (and I suppose you could add checking for all the world's TLDs, and then another level of second level domains for countries with such (ie, com.uk)).

有很多邮件regexp问题出现在这里,而且我真的困惑人们为什么使用这些疯狂的钝角匹配表达式而不是一个简单的解析器,将电子邮件到名字和域的令牌,然后验证这些对有效字符允许名称(没有进一步检查可以做这部分)和有效字符域(我假设你可以添加检查所有世界*域名,然后是第二级域名的国家(例如,com.uk)。

The real problem is that the tlds and slds keep changing (contrary to popular belief), so you have to keep updating the regexp if you plan on doing all this high level checking whenever the root name servers send down a change.

真正的问题是tlds和slds不断变化(与流行的看法相反),因此,如果您打算在根名称服务器发出更改时进行所有这些高级检查,那么您就必须继续更新regexp。

Why not have a module that simply validates domains, which pulls from a database, or flat file, and optionally checks DNS for matching records?

为什么不使用一个模块来验证域(从数据库或平面文件中提取域),并可以选择检查DNS是否匹配记录?

I'm being serious here, why is everyone so keen on inventing the perfect regexp for this? It doesn't seem to be a suitable solution to the problem...

我是认真的,为什么每个人都那么热衷于为这个发明一个完美的regexp ?这似乎不是解决问题的合适办法。

Convince me that it's not only possible to do in regexp (and satisfy everyone) but that it's a better solution than a custom parser/validator.

让我相信,在regexp(并满足所有人)中不仅可以实现,而且它是比自定义解析器/验证器更好的解决方案。

-Adam

亚当

12 个解决方案

#1


24  

They do it because they see "I want to test whether this text matches the spec" and immediately think "I know, I'll use a regex!" without fully understanding the complexity of the spec or the limitations of regexes. Regexes are a wonderful, powerful tool for handling a wide variety of text-matching tasks, but they are not the perfect tool for every such task and it seems that many people who use them lose sight of that fact.

他们这样做是因为他们看到“我想测试这个文本是否与规范匹配”,并立即认为“我知道,我将使用一个regex!”regexe是处理各种文本匹配任务的一个很好的、强大的工具,但是它们并不是每个此类任务的完美工具,似乎许多使用它们的人都忽略了这一事实。

#2


8  

Regexs that catch most (but not all) common error are relatively easy to setup and deploy. Takes longer to write a custom parser.

捕获大多数(但不是全部)常见错误的Regexs相对容易设置和部署。编写自定义解析器需要更长的时间。

#3


8  

The temptation of using RegExp, once you've mastered the basics, is very big. In fact, RegExp seems so powerful that people naturally want to start using it everywhere. I really suspect that there's a lot of psychology involved here, as demonstrated by Randall's XKCD comic (and yes, it is useful).

一旦您掌握了基本知识,使用RegExp的诱惑是非常大的。事实上,RegExp似乎非常强大,人们自然希望在任何地方开始使用它。我真的怀疑这里有很多心理因素,兰道尔的XKCD漫画(是的,它很有用)就证明了这一点。

I've done an introductory presentation on RegExp once and the most important slide warned against its overuse. It was the only slide that used bold font. I believe this should be done more often.

我曾经做过一次关于RegExp的介绍性演讲,最重要的幻灯片警告不要过度使用它。这是唯一一张使用粗体的幻灯片。我认为应该更经常地这样做。

为什么人们使用regexp进行电子邮件和其他复杂的验证?

#4


4  

Using regular expressions for this is not a good idea, as has been demonstrated at length in those other posts.

为此使用正则表达式不是一个好主意,正如在其他文章中详细演示的那样。

I suppose people keep doing it because they don't know any better or don't care.

我想人们一直这样做是因为他们不知道或者不在乎。

Will a parser be any better? Maybe, maybe not.

解析器会更好吗?也许是,也许不是。

I maintain that sending a verification e-mail is the best way to validate it. If you want to check anything from JavaScript, then check that it has an '@' sign in there and something before and after it. If you go any stricter than that, you risc running up against some syntax you didn't know about and your validator will become overly restrictive.

我认为发送验证电子邮件是验证它的最佳方式。如果您想要检查任何来自JavaScript的内容,那么请检查它是否有一个“@”符号,以及它前后的内容。如果再严格一点,您的risc遇到了一些您不知道的语法,您的验证器将变得过于严格。

Also, be careful with that TLD validation scheme of yours, you might find that you are assuming too much about what is allowed in a TLD.

另外,要注意您的TLD验证方案,您可能会发现您在TLD中所允许的内容太多了。

#5


3  

People do it because in most languages it is way easier to write regexp than to write and use a parser in your code (or so it seems, at least).

人们这样做是因为在大多数语言中,写regexp要比在代码中编写和使用解析器容易得多(至少看起来是这样)。

If you decide to eschew regexes, you will have to either write parsers by hand, or you resort to external tools (like yacc) for lexer/parser generation. This is way more complex than single-line regex match.

如果您决定不使用regexes,您将不得不手工编写解析器,或者使用外部工具(如yacc)生成lexer/parser。这比单行regex匹配要复杂得多。

One need to have a library that makes it easy to write parsers directly in the language X (where 'X' is C, C++, C#, Java) to be able to build custom parsers with the same ease as regular expression matchers.

需要有一个库,使直接使用语言X(其中'X'是C、c++、c#、Java)编写解析器变得容易,以便能够像普通表达式匹配器一样轻松地构建自定义解析器。

Such libraries originated in the functional land (Haskell and ML), but nowadays "parser combinators libraries" exist for Java, C++, C#, Scala and other mainstream languages.

这些库起源于函数领域(Haskell和ML),但是现在Java、c++、c#、Scala和其他主流语言都有“解析器组合器库”。

#6


3  

People use regexes for email addresses, HTML, XML, etc. because:

人们使用regexe处理电子邮件地址、HTML、XML等,因为:

  1. It looks like they should work and they often do work for the obvious cases.
  2. 看起来他们应该这样做,而且他们经常在明显的情况下这样做。
  3. They "know" regular expressions. When all you have is a hammer all your problems look like nails.
  4. 他们“知道”正则表达式。当你只有一把锤子时,你所有的问题看起来都像钉子。
  5. Writing a parser is harder (or seems harder) than writing a regular expression. In particular, writing a parser is harder than writing a regex that handles the obvious cases in #1.
  6. 编写解析器比编写正则表达式更困难(或者看起来更困难)。特别是,编写解析器比编写处理#1中明显案例的regex要困难得多。
  7. They don't understand the full complexity of the task.
  8. 他们不理解这项任务的全部复杂性。
  9. They don't understand the limitations of regular expressions.
  10. 他们不理解正则表达式的局限性。
  11. They start with a regex that handles the obvious cases and then try to extend it to handle others. They get locked into one approach.
  12. 它们首先使用regex处理明显的情况,然后尝试将其扩展为处理其他情况。他们被困在一种方法中。
  13. They aren't aware that there's (probably) a library available to do the work for them.
  14. 他们没有意识到有一个图书馆可以为他们做这些工作。

#7


3  

and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion)

然后验证那些针对有效字符的名称(在这部分没有进一步的检查)

This is not true. For example, "ben..doom@gmail.com" contains only valid characters in the name section, but is not valid.

这不是真的。例如,“ben. doom@gmail.com”只在名称部分包含有效字符,但无效。

In languages that do not have libraries for email validation, I generally use regex becasue

在没有用于电子邮件验证的库的语言中,我通常使用regex becasue

  1. I know regex, and find it easy to use
  2. 我知道regex,并且发现它很容易使用。
  3. I have many friends who know regex, and I can collaborate with
  4. 我有很多认识regex的朋友,我可以和他们合作
  5. It's fast for me to code, and me-time is more expensive than processor-time for most applications
  6. 对我来说,编码速度很快,而且对大多数应用程序来说,metime比处理器时间更昂贵
  7. For the majority of email addresses, it works.
  8. 对于大多数电子邮件地址,它是有效的。

I'm sure many built-in libraries do use your approach, and if you want to cover all the possibilities, it does get ridiculous. However, so does your parser. The formal spec for email addresses is absurdly complex. So, we use a regex that gets close enough.

我确信许多内置的库确实使用了您的方法,如果您想要涵盖所有的可能性,那么它将变得非常荒谬。然而,您的解析器也是如此。电子邮件地址的正式规范非常复杂。因此,我们使用了一个足够接近的正则表达式。

#8


3  

I don't believe correct email validation can be done with a single regular expression (now there's a challenge!). One of the issues is that comments can be nested to an arbitrary depth in both the local part and the domain.

我不认为正确的电子邮件验证可以用一个正则表达式来完成(现在有一个挑战!)问题之一是注释可以嵌套到本地部分和域的任意深度。

If you want to validate an address against RFCs 5322 and 5321 (the current standards) then you'll need a procedural function to do so.

如果您希望针对RFCs 5322和5321(当前标准)验证一个地址,那么需要一个过程函数来实现这一点。

Fortunately, this is a commodity problem. Everybody wants the same result: RFC compliance. There's no need for anybody to write this code ever again once it's been solved by an open source function.

幸运的是,这是一个商品问题。每个人都希望得到相同的结果:RFC遵从性。一旦这个代码被一个开源函数解决了,任何人都不需要再编写它。

Check out some of the alternatives here: http://www.dominicsayers.com/isemail/

在这里查看一些替代方法:http://www.dominicsayers.com/isemail/

If you know of another function that I can add to the head-to-head, let me know.

如果你知道另一个我可以直接添加到的函数,请告诉我。

#9


2  

We're just looking for a fast way to see if the email address is valid so that we can warn the user they have made a mistake or prevent people from entering junk easily. Going off to the mail server and fingering it is slow and unreliable. The only real way to be sure is to get a confirmation email, but the problem is only to give a fast response to the user before the confirmation process takes place. That's why it's not so important to be strictly compliant. Anyway, it's a challenge and it's fun.

我们只是在寻找一种快速的方式来查看电子邮件地址是否有效,以便我们可以警告用户他们犯了错误,或者防止人们轻易地进入垃圾邮件。去邮件服务器和指指点点是缓慢和不可靠的。唯一可以确定的方法是收到一封确认邮件,但是问题是在确认过程发生之前给用户一个快速的回复。这就是为什么严格遵守是不重要的。无论如何,这是一个挑战,也是有趣的。

#10


1  

People write regular expressions because most developers like so solve a simple problem in the most "cool" en "efficient" way (which means that it should be as unreadable as possible).

人们编写正则表达式是因为大多数开发人员喜欢用最“酷”的“高效”方法解决一个简单的问题(这意味着它应该尽可能的不可读)。

In Java, there are libraries to check if a String represents an email address without you having to know anything about regular expressions. These libraries should be available for other languages aswel.

在Java中,有一些库可以检查字符串是否表示电子邮件地址,而不需要了解任何正则表达式。这些库应该也适用于其他语言。

Like Jamie Zawinski said in 1997: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

就像杰米•扎文斯基(Jamie Zawinski)在1997年说的那样:“有些人遇到问题时,会想,‘我知道,我会使用正则表达式。’”现在他们有两个问题。

#11


1  

On factor: the set of people who understand how to write a regular expression is very much larger than the set of people who understand the formal constraints on regular languages. Same goes for non-regular "regular expressions".

关于因素:理解如何编写正则表达式的人的集合要比理解正则语言的形式约束的人的集合大得多。非正则“正则表达式”也是如此。

#12


-3  

Regexps are much faster to use, of course, and they only validate what's specified in the RFC. Write a custom parser? What? It takes 10 seconds to use a regexp.

当然,regexp使用起来要快得多,而且它们只验证RFC中指定的内容。编写一个自定义解析器吗?什么?使用regexp需要10秒。

#1


24  

They do it because they see "I want to test whether this text matches the spec" and immediately think "I know, I'll use a regex!" without fully understanding the complexity of the spec or the limitations of regexes. Regexes are a wonderful, powerful tool for handling a wide variety of text-matching tasks, but they are not the perfect tool for every such task and it seems that many people who use them lose sight of that fact.

他们这样做是因为他们看到“我想测试这个文本是否与规范匹配”,并立即认为“我知道,我将使用一个regex!”regexe是处理各种文本匹配任务的一个很好的、强大的工具,但是它们并不是每个此类任务的完美工具,似乎许多使用它们的人都忽略了这一事实。

#2


8  

Regexs that catch most (but not all) common error are relatively easy to setup and deploy. Takes longer to write a custom parser.

捕获大多数(但不是全部)常见错误的Regexs相对容易设置和部署。编写自定义解析器需要更长的时间。

#3


8  

The temptation of using RegExp, once you've mastered the basics, is very big. In fact, RegExp seems so powerful that people naturally want to start using it everywhere. I really suspect that there's a lot of psychology involved here, as demonstrated by Randall's XKCD comic (and yes, it is useful).

一旦您掌握了基本知识,使用RegExp的诱惑是非常大的。事实上,RegExp似乎非常强大,人们自然希望在任何地方开始使用它。我真的怀疑这里有很多心理因素,兰道尔的XKCD漫画(是的,它很有用)就证明了这一点。

I've done an introductory presentation on RegExp once and the most important slide warned against its overuse. It was the only slide that used bold font. I believe this should be done more often.

我曾经做过一次关于RegExp的介绍性演讲,最重要的幻灯片警告不要过度使用它。这是唯一一张使用粗体的幻灯片。我认为应该更经常地这样做。

为什么人们使用regexp进行电子邮件和其他复杂的验证?

#4


4  

Using regular expressions for this is not a good idea, as has been demonstrated at length in those other posts.

为此使用正则表达式不是一个好主意,正如在其他文章中详细演示的那样。

I suppose people keep doing it because they don't know any better or don't care.

我想人们一直这样做是因为他们不知道或者不在乎。

Will a parser be any better? Maybe, maybe not.

解析器会更好吗?也许是,也许不是。

I maintain that sending a verification e-mail is the best way to validate it. If you want to check anything from JavaScript, then check that it has an '@' sign in there and something before and after it. If you go any stricter than that, you risc running up against some syntax you didn't know about and your validator will become overly restrictive.

我认为发送验证电子邮件是验证它的最佳方式。如果您想要检查任何来自JavaScript的内容,那么请检查它是否有一个“@”符号,以及它前后的内容。如果再严格一点,您的risc遇到了一些您不知道的语法,您的验证器将变得过于严格。

Also, be careful with that TLD validation scheme of yours, you might find that you are assuming too much about what is allowed in a TLD.

另外,要注意您的TLD验证方案,您可能会发现您在TLD中所允许的内容太多了。

#5


3  

People do it because in most languages it is way easier to write regexp than to write and use a parser in your code (or so it seems, at least).

人们这样做是因为在大多数语言中,写regexp要比在代码中编写和使用解析器容易得多(至少看起来是这样)。

If you decide to eschew regexes, you will have to either write parsers by hand, or you resort to external tools (like yacc) for lexer/parser generation. This is way more complex than single-line regex match.

如果您决定不使用regexes,您将不得不手工编写解析器,或者使用外部工具(如yacc)生成lexer/parser。这比单行regex匹配要复杂得多。

One need to have a library that makes it easy to write parsers directly in the language X (where 'X' is C, C++, C#, Java) to be able to build custom parsers with the same ease as regular expression matchers.

需要有一个库,使直接使用语言X(其中'X'是C、c++、c#、Java)编写解析器变得容易,以便能够像普通表达式匹配器一样轻松地构建自定义解析器。

Such libraries originated in the functional land (Haskell and ML), but nowadays "parser combinators libraries" exist for Java, C++, C#, Scala and other mainstream languages.

这些库起源于函数领域(Haskell和ML),但是现在Java、c++、c#、Scala和其他主流语言都有“解析器组合器库”。

#6


3  

People use regexes for email addresses, HTML, XML, etc. because:

人们使用regexe处理电子邮件地址、HTML、XML等,因为:

  1. It looks like they should work and they often do work for the obvious cases.
  2. 看起来他们应该这样做,而且他们经常在明显的情况下这样做。
  3. They "know" regular expressions. When all you have is a hammer all your problems look like nails.
  4. 他们“知道”正则表达式。当你只有一把锤子时,你所有的问题看起来都像钉子。
  5. Writing a parser is harder (or seems harder) than writing a regular expression. In particular, writing a parser is harder than writing a regex that handles the obvious cases in #1.
  6. 编写解析器比编写正则表达式更困难(或者看起来更困难)。特别是,编写解析器比编写处理#1中明显案例的regex要困难得多。
  7. They don't understand the full complexity of the task.
  8. 他们不理解这项任务的全部复杂性。
  9. They don't understand the limitations of regular expressions.
  10. 他们不理解正则表达式的局限性。
  11. They start with a regex that handles the obvious cases and then try to extend it to handle others. They get locked into one approach.
  12. 它们首先使用regex处理明显的情况,然后尝试将其扩展为处理其他情况。他们被困在一种方法中。
  13. They aren't aware that there's (probably) a library available to do the work for them.
  14. 他们没有意识到有一个图书馆可以为他们做这些工作。

#7


3  

and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion)

然后验证那些针对有效字符的名称(在这部分没有进一步的检查)

This is not true. For example, "ben..doom@gmail.com" contains only valid characters in the name section, but is not valid.

这不是真的。例如,“ben. doom@gmail.com”只在名称部分包含有效字符,但无效。

In languages that do not have libraries for email validation, I generally use regex becasue

在没有用于电子邮件验证的库的语言中,我通常使用regex becasue

  1. I know regex, and find it easy to use
  2. 我知道regex,并且发现它很容易使用。
  3. I have many friends who know regex, and I can collaborate with
  4. 我有很多认识regex的朋友,我可以和他们合作
  5. It's fast for me to code, and me-time is more expensive than processor-time for most applications
  6. 对我来说,编码速度很快,而且对大多数应用程序来说,metime比处理器时间更昂贵
  7. For the majority of email addresses, it works.
  8. 对于大多数电子邮件地址,它是有效的。

I'm sure many built-in libraries do use your approach, and if you want to cover all the possibilities, it does get ridiculous. However, so does your parser. The formal spec for email addresses is absurdly complex. So, we use a regex that gets close enough.

我确信许多内置的库确实使用了您的方法,如果您想要涵盖所有的可能性,那么它将变得非常荒谬。然而,您的解析器也是如此。电子邮件地址的正式规范非常复杂。因此,我们使用了一个足够接近的正则表达式。

#8


3  

I don't believe correct email validation can be done with a single regular expression (now there's a challenge!). One of the issues is that comments can be nested to an arbitrary depth in both the local part and the domain.

我不认为正确的电子邮件验证可以用一个正则表达式来完成(现在有一个挑战!)问题之一是注释可以嵌套到本地部分和域的任意深度。

If you want to validate an address against RFCs 5322 and 5321 (the current standards) then you'll need a procedural function to do so.

如果您希望针对RFCs 5322和5321(当前标准)验证一个地址,那么需要一个过程函数来实现这一点。

Fortunately, this is a commodity problem. Everybody wants the same result: RFC compliance. There's no need for anybody to write this code ever again once it's been solved by an open source function.

幸运的是,这是一个商品问题。每个人都希望得到相同的结果:RFC遵从性。一旦这个代码被一个开源函数解决了,任何人都不需要再编写它。

Check out some of the alternatives here: http://www.dominicsayers.com/isemail/

在这里查看一些替代方法:http://www.dominicsayers.com/isemail/

If you know of another function that I can add to the head-to-head, let me know.

如果你知道另一个我可以直接添加到的函数,请告诉我。

#9


2  

We're just looking for a fast way to see if the email address is valid so that we can warn the user they have made a mistake or prevent people from entering junk easily. Going off to the mail server and fingering it is slow and unreliable. The only real way to be sure is to get a confirmation email, but the problem is only to give a fast response to the user before the confirmation process takes place. That's why it's not so important to be strictly compliant. Anyway, it's a challenge and it's fun.

我们只是在寻找一种快速的方式来查看电子邮件地址是否有效,以便我们可以警告用户他们犯了错误,或者防止人们轻易地进入垃圾邮件。去邮件服务器和指指点点是缓慢和不可靠的。唯一可以确定的方法是收到一封确认邮件,但是问题是在确认过程发生之前给用户一个快速的回复。这就是为什么严格遵守是不重要的。无论如何,这是一个挑战,也是有趣的。

#10


1  

People write regular expressions because most developers like so solve a simple problem in the most "cool" en "efficient" way (which means that it should be as unreadable as possible).

人们编写正则表达式是因为大多数开发人员喜欢用最“酷”的“高效”方法解决一个简单的问题(这意味着它应该尽可能的不可读)。

In Java, there are libraries to check if a String represents an email address without you having to know anything about regular expressions. These libraries should be available for other languages aswel.

在Java中,有一些库可以检查字符串是否表示电子邮件地址,而不需要了解任何正则表达式。这些库应该也适用于其他语言。

Like Jamie Zawinski said in 1997: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

就像杰米•扎文斯基(Jamie Zawinski)在1997年说的那样:“有些人遇到问题时,会想,‘我知道,我会使用正则表达式。’”现在他们有两个问题。

#11


1  

On factor: the set of people who understand how to write a regular expression is very much larger than the set of people who understand the formal constraints on regular languages. Same goes for non-regular "regular expressions".

关于因素:理解如何编写正则表达式的人的集合要比理解正则语言的形式约束的人的集合大得多。非正则“正则表达式”也是如此。

#12


-3  

Regexps are much faster to use, of course, and they only validate what's specified in the RFC. Write a custom parser? What? It takes 10 seconds to use a regexp.

当然,regexp使用起来要快得多,而且它们只验证RFC中指定的内容。编写一个自定义解析器吗?什么?使用regexp需要10秒。