如何使用Nokogiri在HTML代码中找到电子邮件地址?

How can I find an email address inside HTML code with Nokogiri? I supose I will need to use regex, but don't know how.

如何使用Nokogiri在HTML代码中找到电子邮件地址?我想我需要使用regex，但不知道如何使用。

Example code

示例代码

    <html>
    <title>Example</title>
    <body>
    This is an example text.
    example@example.com
    </body>
    </html>

There is an answer covering the case when there is a href to mail_to, but that is not my case. The email addresses are sometimes inside a link, but not always.

当有一个href to mail_to时，会有一个覆盖该情况的答案，但这不是我的情况。电子邮件地址有时在链接中，但并不总是如此。

Thanks

谢谢

2 个解决方案

#1

If you're just trying to parse the email address from a string that just so happens to be HTML, Nokogiri isn't needed for this.

如果您只是试图从一个恰好是HTML的字符串解析电子邮件地址，则不需要Nokogiri。

html_string   = "Your HTML here..."
email_address = html_string.match(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i)[0]

This isn't a perfect solution though, as the RFC for what constitutes a 'valid' email address is very lenient. This means most regular expressions you come across (the above one included) do not account for edge case valid addresses. For example, according to the RFC

不过，这并不是一个完美的解决方案，因为RFC对于“有效”电子邮件地址的定义非常宽松。这意味着您遇到的大多数正则表达式(包括上面的表达式)都不考虑边缘大小写有效地址。例如，根据RFC

$A12345@example.com

is a valid email address, but will not be matched by the above regular expressions as it stands.

是一个有效的电子邮件地址，但不会与上面的正则表达式匹配。

Suggested Reading: http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx
推荐阅读:http://haacked.com/archive/2007/08/21/i-knew-how-to-validate-an-email-address-until-i.aspx
Regex source: http://www.dzone.com/snippets/ruby-method-extract-emails
Regex来源:http://www.dzone.com/snippets/ruby-method-extract-emails

#2

Just use a regex on the HTML string, no need for Nokogiri (as @deefour suggested). For the regex itself, I'd suggest the one (called AUTO_EMAIL_RE) used by the rails autolink gem:

只需在HTML字符串上使用regex，不需要Nokogiri(如@deefour建议的那样)。对于regex本身，我建议使用rails autolink gem使用的一个(称为AUTO_EMAIL_RE):

/[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

This should catch those edge cases that stricter regex filters miss:

这将捕获更严格的regex过滤器漏掉的边缘情况:

RE = /[\w.!#\$%+-]+@[\w-]+(?:\.[\w-]+)+/

RE.match('abc@example.com')
#=> #<MatchData "abc@example.com">

RE.match('$A12345@example.com')
#=> #<MatchData "$A12345@example.com">

Note that if you really want to match all valid email addresses, you're going to need a mighty big regex.

注意，如果您真的想匹配所有有效的电子邮件地址，您将需要一个强大的regex。

#1