解析电子邮件地址字符串的最佳方法

时间:2022-10-22 09:34:18

So i am working with some email header data, and for the to:, from:, cc:, and bcc: fields the email address(es) can be expressed in a number of different ways:

所以我正在使用一些电子邮件标题数据,并且对于:,from:,cc:和bcc:字段,电子邮件地址可以用多种不同的方式表示:

First Last <name@domain.com>
Last, First <name@domain.com>
name@domain.com

And these variations can appear in the same message, in any order, all in one comma separated string:

这些变体可以以任何顺序出现在同一个消息中,所有这些变量都以逗号分隔的字符串形式出现:

First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>

I've been trying to come up with a way to parse this string into separate First Name, Last Name, E-Mail for each person (omitting the name if only an email address is provided).

我一直试图想出一种方法来将这个字符串解析为每个人的单独的名字,姓氏,电子邮件(如果只提供了一个电子邮件地址,则省略名称)。

Can someone suggest the best way to do this?

有人可以建议最好的方法吗?

I've tried to Split on the commas, which would work except in the second example where the last name is placed first. I suppose this method could work, if after i split, i examine each element and see if it contains a '@' or '<'/'>', if it doesn't then it could be assumed that the next element is the first name. Is this a good way to approach this? Have i overlooked another format the address could be in?

我试图在逗号上拆分,除了在第一个放置姓氏的第二个例子之外,它会起作用。我想这个方法可以工作,如果我拆分后,我检查每个元素,看它是否包含'@'或'<'/'>',如果没有,那么可以假设下一个元素是名字。这是解决这个问题的好方法吗?我是否忽略了地址可能存在的另一种格式?


UPDATE: Perhaps i should clarify a little, basically all i am looking to do is break up the string containing the multiple addresses into individual strings containing the address in whatever format it was sent in. I have my own methods for validating and extracting the information from an address, it was just tricky for me to figure out the best way to separate each address.

更新:也许我应该澄清一点,基本上我要做的就是将包含多个地址的字符串拆分为包含地址的单个字符串,无论发送的格式是什么。我有自己的方法来验证和提取信息从一个地址来看,找出分隔每个地址的最佳方法对我来说简直太棘手了。

Here is the solution i came up with to accomplish this:

这是我想出的解决方案:

String str = "Last, First <name@domain.com>, name@domain.com, First Last <name@domain.com>, \"First Last\" <name@domain.com>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
    if (str[c] == '@')
        atIdx = c;

    if (str[c] == ',')
        commaIdx = c;

    if (commaIdx > atIdx && atIdx > 0)
    {
        string temp = str.Substring(lastComma, commaIdx - lastComma);
        addresses.Add(temp);
        lastComma = commaIdx;
        atIdx = commaIdx;
    }

    if (c == str.Length -1)
    {
        string temp = str.Substring(lastComma, str.Legth - lastComma);
        addresses.Add(temp);
    }
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

The above code generates the individual addresses that i can process further down the line.

上面的代码生成了我可以进一步处理的各个地址。

12 个解决方案

#1


There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.

对此没有一个简单的解决方案。我建议制作一个小型的状态机来读取char-by-char并以这种方式完成工作。就像你说的,用逗号分割并不总是有效。

A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last"

状态机将允许您涵盖所有可能性。我相信还有很多其他你还没见过的人。例如:“First Last”

Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.

寻找关于此的RFC以发现所有可能性。对不起,我不知道这个号码。可能有多种,因为这是一种发展的东西。

#2


At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.

冒着创建两个问题的风险,您可以创建一个与您的任何电子邮件格式匹配的正则表达式。使用“|”分离这一个正则表达式中的格式。然后,您可以在输入字符串上运行它并拉出所有匹配项。

public class Address
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public Address(string first, string last, string name, string domain)
    {
        _first = first;
        _last = last;
        _name = name;
        _domain = domain;
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }
}

[TestFixture]
public class RegexEmailTest
{
    [Test]
    public void TestThreeEmailAddresses()
    {
        Regex emailAddress = new Regex(
            @"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<name>\w*)@(?<domain>\w*\.\w*))");
        string input = "First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>";

        MatchCollection matches = emailAddress.Matches(input);
        List<Address> addresses =
            (from Match match in matches
             select new Address(
                 match.Groups["first"].Value,
                 match.Groups["last"].Value,
                 match.Groups["name"].Value,
                 match.Groups["domain"].Value)).ToList();
        Assert.AreEqual(3, addresses.Count);

        Assert.AreEqual("Last", addresses[0].First);
        Assert.AreEqual("First", addresses[0].Last);
        Assert.AreEqual("name", addresses[0].Name);
        Assert.AreEqual("domain.com", addresses[0].Domain);

        Assert.AreEqual("", addresses[1].First);
        Assert.AreEqual("", addresses[1].Last);
        Assert.AreEqual("name", addresses[1].Name);
        Assert.AreEqual("domain.com", addresses[1].Domain);

        Assert.AreEqual("First", addresses[2].First);
        Assert.AreEqual("Last", addresses[2].Last);
        Assert.AreEqual("name", addresses[2].Name);
        Assert.AreEqual("domain.com", addresses[2].Domain);
    }
}

There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.

这种方法有几个缺点。一个是它不验证字符串。如果字符串中的任何字符不符合您选择的格式,则只会忽略这些字符。另一个是所接受的格式都在一个地方表达。如果不更改单片正则表达式,则无法添加新格式。

#3


There is internal System.Net.Mail.MailAddressParser class which has method ParseMultipleAddresses which does exactly what you want. You can access it directly through reflection or by calling MailMessage.To.Add method, which accepts email list string.

有一个内部的System.Net.Mail.MailAddressParser类,它有一个方法ParseMultipleAddresses,可以完全按照你的意愿执行。您可以通过反射或通过调用MailMessage.To.Add方法直接访问它,该方法接受电子邮件列表字符串。

private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
    var mailAddressParserClass = Type.GetType("System.Net.Mail.MailAddressParser");
    var parseMultipleAddressesMethod = mailAddressParserClass.GetMethod("ParseMultipleAddresses", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
    return (IList<MailAddress>)parseMultipleAddressesMethod.Invoke(null, new object[0]);
}


    private static IEnumerable<MailAddress> ParseAddress(string addresses)
    {
        MailMessage message = new MailMessage();
        message.To.Add(addresses);
        return new List<MailAddress>(message.To); //new List, because we don't want to hold reference on Disposable object
    }

#4


Your 2nd email example is not a valid address as it contains a comma which is not within a quoted string. To be valid it should be like: "Last, First"<name@domain.com>.

您的第二个电子邮件示例不是有效地址,因为它包含的逗号不在带引号的字符串中。为了有效,它应该是:“Last,First” @domain.com>

As for parsing, if you want something that is quite strict, you could use System.Net.Mail.MailAddressCollection.

至于解析,如果你想要一些非常严格的东西,你可以使用System.Net.Mail.MailAddressCollection。

If you just want to your input split into separate email strings, then the following code should work. It is not very strict but will handle commas within quoted strings and throw an exception if the input contains an unclosed quote.

如果您只想将输入拆分为单独的电子邮件字符串,则以下代码应该可以正常工作。它不是很严格,但会在引用的字符串中处理逗号,如果输入包含未闭合的引号则抛出异常。

public List<string> SplitAddresses(string addresses)
{
    var result = new List<string>();

    var startIndex = 0;
    var currentIndex = 0;
    var inQuotedString = false;

    while (currentIndex < addresses.Length)
    {
        if (addresses[currentIndex] == QUOTE)
        {
            inQuotedString = !inQuotedString;
        }
        // Split if a comma is found, unless inside a quoted string
        else if (addresses[currentIndex] == COMMA && !inQuotedString)
        {
            var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
            if (address.Length > 0)
            {
                result.Add(address);
            }
            startIndex = currentIndex + 1;
        }
        currentIndex++;
    }

    if (currentIndex > startIndex)
    {
        var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
        if (address.Length > 0)
        {
            result.Add(address);
        }
    }

    if (inQuotedString)
        throw new FormatException("Unclosed quote in email addresses");

    return result;
}

private string GetAndCleanSubstring(string addresses, int startIndex, int currentIndex)
{
    var address = addresses.Substring(startIndex, currentIndex - startIndex);
    address = address.Trim();
    return address;
}

#5


There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.

对此没有通用的简单解决方案。您想要的RFC是RFC2822,它描述了电子邮件地址的所有可能配置。您将获得的最佳方法是实现遵循RFC中指定的规则的基于状态的标记生成器。

#6


Here is the solution i came up with to accomplish this:

这是我想出的解决方案:

String str = "Last, First <name@domain.com>, name@domain.com, First Last <name@domain.com>, \"First Last\" <name@domain.com>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
    atIdx = c;

if (str[c] == ',')
    commaIdx = c;

if (commaIdx > atIdx && atIdx > 0)
{
    string temp = str.Substring(lastComma, commaIdx - lastComma);
    addresses.Add(temp);
    lastComma = commaIdx;
    atIdx = commaIdx;
}

if (c == str.Length -1)
{
    string temp = str.Substring(lastComma, str.Legth - lastComma);
    addresses.Add(temp);
}
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

#7


Here is how I would do it:

我是这样做的:

  • You can try to standardize the data as much as possible i.e. get rid of such things as the < and > symbols and all of the commas after the '.com.' You will need the commas that separate the first and last names.
  • 您可以尝试尽可能地标准化数据,即除去 <和> 符号以及'.com'之后的所有逗号。您将需要用于分隔名字和姓氏的逗号。

  • After getting rid of the extra symbols, put every grouped email record in a list as a string. You can use the .com to determine where to split the string if need be.
  • 在删除额外符号后,将每个分组的电子邮件记录作为字符串放在列表中。如果需要,您可以使用.com来确定拆分字符串的位置。

  • After you have the list of email addresses in the list of strings, you can then further split the email addresses using only whitespace as the delimeter.
  • 在字符串列表中有电子邮件地址列表后,您可以使用空格作为分隔符进一步拆分电子邮件地址。

  • The final step is to determine what is the first name, what is the last name, etc. This would be done by checking the 3 components for: a comma, which would indicate that it is the last name; a . which would indicate the actual address; and whatever is left is the first name. If there is no comma, then the first name is first, last name is second, etc.

    I don't know if this is the most concise solution, but it would work and does not require any advanced programming techniques
  • 最后一步是确定名字是什么,姓氏是什么等等。这可以通过检查3个组件来完成:逗号,表示它是姓氏;一个 。这表示实际地址;剩下的就是名字。如果没有逗号,那么第一个名字是第一个,最后一个名字是第二个,等等。我不知道这是否是最简洁的解决方案,但它可以工作,不需要任何高级编程技术

#8


You could use regular expressions to try to separate this out, try this guy:

您可以使用正则表达式尝试将其分开,试试这个人:

^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$

will match: Last, First test@test.com; Last, First <test@test.com>; First last test@test.com; First Last <test@test.com>. You can add another optional match in the regex at the end to pick up the last segment of First, Last <name@domain.com>, name@domain.com after the email address enclosed in angled braces.

将匹配:Last,First test@test.com;最后,第一个 ;最后一次test@test.com; First Last 。您可以在最后的正则表达式中添加另一个可选匹配项,以便在包含在斜角括号中的电子邮件地址后选取最后一段 ,name @ domain.com。 ,last> @test.com> @test.com>

Hope this helps somewhat!

希望这有点帮助!

EDIT:

and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.

当然,您可以为每个部分添加更多字符以接受引用等任何格式正在读取。正如sjbotha所提到的,这可能很难,因为提交的字符串不一定是设置格式。

This link can give you more information about matching AND validating email addresses using regular expressions.

此链接可以为您提供有关使用正则表达式匹配和验证电子邮件地址的更多信息。

#9


// Based on Michael Perry's answer * // needs to handle first.last@domain.com, first_last@domain.com and related syntaxes // also looks for first and last name within those email syntaxes

//基于Michael Perry的回答* //需要处理first.last@domain.com,first_last@domain.com和相关语法//还会查找这些电子邮件语法中的名字和姓氏

public class ParsedEmail
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public ParsedEmail(string first, string last, string name, string domain)
    {
        _name = name;
        _domain = domain;

        // first.last@domain.com, first_last@domain.com etc. syntax
        char[] chars = { '.', '_', '+', '-' };
        var pos = _name.IndexOfAny(chars);

        if (string.IsNullOrWhiteSpace(_first) && string.IsNullOrWhiteSpace(_last) && pos > -1)
        {
            _first = _name.Substring(0, pos);
            _last = _name.Substring(pos+1);
        }
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }

    public string Email
    {
        get
        {
            return Name + "@" + Domain;
        }
    }

    public override string ToString()
    {
        return Email;
    }

    public static IEnumerable<ParsedEmail> SplitEmailList(string delimList)
    {
        delimList = delimList.Replace("\"", string.Empty);

        Regex re = new Regex(
                    @"((?<last>\w*), (?<first>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<first>\w*) (?<last>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*))");


        MatchCollection matches = re.Matches(delimList);

        var parsedEmails =
                   (from Match match in matches
                    select new ParsedEmail(
                            match.Groups["first"].Value,
                            match.Groups["last"].Value,
                            match.Groups["name"].Value,
                            match.Groups["domain"].Value)).ToList();

        return parsedEmails;

    }


}

#10


I decided that I was going to draw a line in the sand at two restrictions:

我决定在两个限制条件下在沙滩上划一条线:

  1. The To and Cc headers have to be csv parseable strings.
  2. To和Cc头必须是csv可解析字符串。

  3. Anything MailAddress couldn't parse, I'm just not going to worry about it.
  4. 任何MailAddress都无法解析,我只是不担心它。

I also decided I'm just interested in email addresses and not display name, since display name is so problematic and hard to define, whereas email address I can validate. So I used MailAddress to validate my parsing.

我还决定我只对电子邮件地址感兴趣,而不是显示名称,因为显示名称是如此有问题且难以定义,而电子邮件地址我可以验证。所以我使用MailAddress来验证我的解析。

I treated the To and Cc headers like a csv string, and again, anything not parseable in that way I don't worry about it.

我把To和Cc标题视为csv字符串,再次,任何不可解析的东西我都不担心。

private string GetProperlyFormattedEmailString(string emailString)
    {
        var emailStringParts = CSVProcessor.GetFieldsFromString(emailString);

        string emailStringProcessed = "";

        foreach (var part in emailStringParts)
        {
            try
            {
                var address = new MailAddress(part);
                emailStringProcessed += address.Address + ",";
            }
            catch (Exception)
            {
                //wasn't an email address
                throw;
            }
        }

        return emailStringProcessed.TrimEnd((','));
    }

EDIT

Further research has showed me that my assumptions are good. Reading through the spec RFC 2822 pretty much shows that the To, Cc, and Bcc fields are csv-parseable fields. So yeah it's hard and there are a lot of gotchas, as with any csv parsing, but if you have a reliable way to parse csv fields (which TextFieldParser in the Microsoft.VisualBasic.FileIO namespace is, and is what I used for this), then you are golden.

进一步的研究表明我的假设是好的。阅读规范RFC 2822几乎可以看出To,Cc和Bcc字段是csv-parseable字段。所以是的,它很难,并且有很多陷阱,就像任何csv解析一样,但是如果你有一个可靠的方法来解析csv字段(Microsoft.VisualBasic.FileIO命名空间中的TextFieldParser是,我就是这个用的)那你就是金色的。

Edit 2

Apparently they don't need to be valid CSV strings...the quotes really mess things up. So your csv parser has to be fault tolerant. I made it try to parse the string, if it failed, it strips all quotes and tries again:

显然他们不需要是有效的CSV字符串...引号真的搞砸了。所以你的csv解析器必须是容错的。我试图解析字符串,如果失败,它会删除所有引号并再次尝试:

public static string[] GetFieldsFromString(string csvString)
    {
        using (var stringAsReader = new StringReader(csvString))
        {
            using (var textFieldParser = new TextFieldParser(stringAsReader))
            {
                SetUpTextFieldParser(textFieldParser, FieldType.Delimited, new[] {","}, false, true);

                try
                {
                    return textFieldParser.ReadFields();
                }
                catch (MalformedLineException ex1)
                {
                    //assume it's not parseable due to double quotes, so we strip them all out and take what we have
                    var sanitizedString = csvString.Replace("\"", "");

                    using (var sanitizedStringAsReader = new StringReader(sanitizedString))
                    {
                        using (var textFieldParser2 = new TextFieldParser(sanitizedStringAsReader))
                        {
                            SetUpTextFieldParser(textFieldParser2, FieldType.Delimited, new[] {","}, false, true);

                            try
                            {
                                return textFieldParser2.ReadFields().Select(part => part.Trim()).ToArray();
                            }
                            catch (MalformedLineException ex2)
                            {
                                return new string[] {csvString};
                            }
                        }
                    }
                }
            }
        }
    }

The one thing it won't handle is quoted accounts in an email i.e. "Monkey Header"@stupidemailaddresses.com.

它不会处理的一件事是在电子邮件中引用帐户,即“Monkey Header”@ stupidemailaddresses.com。

And here's the test:

这是测试:

[Subject(typeof(CSVProcessor))]
public class when_processing_an_email_recipient_header
{
    static string recipientHeaderToParse1 = @"""Lastname, Firstname"" <firstname_lastname@domain.com>" + "," +
                                           @"<testto@domain.com>, testto1@domain.com, testto2@domain.com" + "," +
                                           @"<testcc@domain.com>, test3@domain.com" + "," +
                                           @"""""Yes, this is valid""""@[emails are hard to parse!]" + "," +
                                           @"First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>"
                                           ;

    static string[] results1;
    static string[] expectedResults1;

    Establish context = () =>
    {
        expectedResults1 = new string[]
        {
            @"Lastname",
            @"Firstname <firstname_lastname@domain.com>",
            @"<testto@domain.com>",
            @"testto1@domain.com",
            @"testto2@domain.com",
            @"<testcc@domain.com>",
            @"test3@domain.com",
            @"Yes",
            @"this is valid@[emails are hard to parse!]",
            @"First",
            @"Last <name@domain.com>",
            @"name@domain.com",
            @"First Last <name@domain.com>"
        };
    };

    Because of = () =>
    {
        results1 = CSVProcessor.GetFieldsFromString(recipientHeaderToParse1);
    };

    It should_parse_the_email_parts_properly = () => results1.ShouldBeLike(expectedResults1);
}

#11


Here's what I came up with. It assumes that a valid email address must have one and only one '@' sign in it:

这就是我想出的。它假定有效的电子邮件地址必须只有一个“@”符号:

    public List<MailAddress> ParseAddresses(string field)
    {
        var tokens = field.Split(',');
        var addresses = new List<string>();

        var tokenBuffer = new List<string>();

        foreach (var token in tokens)
        {
            tokenBuffer.Add(token);

            if (token.IndexOf("@", StringComparison.Ordinal) > -1)
            {
                addresses.Add( string.Join( ",", tokenBuffer));
                tokenBuffer.Clear();
            }
        }

        return addresses.Select(t => new MailAddress(t)).ToList();
    }

#12


I use the following regular expression in Java to get email string from RFC-compliant email address:

我在Java中使用以下正则表达式从RFC兼容的电子邮件地址中获取电子邮件字符串:

[A-Za-z0-9]+[A-Za-z0-9._-]+@[A-Za-z0-9]+[A-Za-z0-9._-]+[.][A-Za-z0-9]{2,3}

#1


There isn't really an easy solution to this. I would recommend making a little state machine that reads char-by-char and do the work that way. Like you said, splitting by comma won't always work.

对此没有一个简单的解决方案。我建议制作一个小型的状态机来读取char-by-char并以这种方式完成工作。就像你说的,用逗号分割并不总是有效。

A state machine will allow you to cover all possibilities. I'm sure there are many others you haven't seen yet. For example: "First Last"

状态机将允许您涵盖所有可能性。我相信还有很多其他你还没见过的人。例如:“First Last”

Look for the RFC about this to discover what all the possibilities are. Sorry, I don't know the number. There are probably multiple as this is the kind of things that evolves.

寻找关于此的RFC以发现所有可能性。对不起,我不知道这个号码。可能有多种,因为这是一种发展的东西。

#2


At the risk of creating two problems, you could create a regular expression that matches any of your email formats. Use "|" to separate the formats within this one regex. Then you can run it over your input string and pull out all of the matches.

冒着创建两个问题的风险,您可以创建一个与您的任何电子邮件格式匹配的正则表达式。使用“|”分离这一个正则表达式中的格式。然后,您可以在输入字符串上运行它并拉出所有匹配项。

public class Address
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public Address(string first, string last, string name, string domain)
    {
        _first = first;
        _last = last;
        _name = name;
        _domain = domain;
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }
}

[TestFixture]
public class RegexEmailTest
{
    [Test]
    public void TestThreeEmailAddresses()
    {
        Regex emailAddress = new Regex(
            @"((?<last>\w*), (?<first>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<first>\w*) (?<last>\w*) <(?<name>\w*)@(?<domain>\w*\.\w*)>)|" +
            @"((?<name>\w*)@(?<domain>\w*\.\w*))");
        string input = "First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>";

        MatchCollection matches = emailAddress.Matches(input);
        List<Address> addresses =
            (from Match match in matches
             select new Address(
                 match.Groups["first"].Value,
                 match.Groups["last"].Value,
                 match.Groups["name"].Value,
                 match.Groups["domain"].Value)).ToList();
        Assert.AreEqual(3, addresses.Count);

        Assert.AreEqual("Last", addresses[0].First);
        Assert.AreEqual("First", addresses[0].Last);
        Assert.AreEqual("name", addresses[0].Name);
        Assert.AreEqual("domain.com", addresses[0].Domain);

        Assert.AreEqual("", addresses[1].First);
        Assert.AreEqual("", addresses[1].Last);
        Assert.AreEqual("name", addresses[1].Name);
        Assert.AreEqual("domain.com", addresses[1].Domain);

        Assert.AreEqual("First", addresses[2].First);
        Assert.AreEqual("Last", addresses[2].Last);
        Assert.AreEqual("name", addresses[2].Name);
        Assert.AreEqual("domain.com", addresses[2].Domain);
    }
}

There are several down sides to this approach. One is that it doesn't validate the string. If you have any characters in the string that don't fit one of your chosen formats, then those characters are just ignored. Another is that the accepted formats are all expressed in one place. You cannot add new formats without changing the monolithic regex.

这种方法有几个缺点。一个是它不验证字符串。如果字符串中的任何字符不符合您选择的格式,则只会忽略这些字符。另一个是所接受的格式都在一个地方表达。如果不更改单片正则表达式,则无法添加新格式。

#3


There is internal System.Net.Mail.MailAddressParser class which has method ParseMultipleAddresses which does exactly what you want. You can access it directly through reflection or by calling MailMessage.To.Add method, which accepts email list string.

有一个内部的System.Net.Mail.MailAddressParser类,它有一个方法ParseMultipleAddresses,可以完全按照你的意愿执行。您可以通过反射或通过调用MailMessage.To.Add方法直接访问它,该方法接受电子邮件列表字符串。

private static IEnumerable<MailAddress> ParseAddress(string addresses)
{
    var mailAddressParserClass = Type.GetType("System.Net.Mail.MailAddressParser");
    var parseMultipleAddressesMethod = mailAddressParserClass.GetMethod("ParseMultipleAddresses", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Static);
    return (IList<MailAddress>)parseMultipleAddressesMethod.Invoke(null, new object[0]);
}


    private static IEnumerable<MailAddress> ParseAddress(string addresses)
    {
        MailMessage message = new MailMessage();
        message.To.Add(addresses);
        return new List<MailAddress>(message.To); //new List, because we don't want to hold reference on Disposable object
    }

#4


Your 2nd email example is not a valid address as it contains a comma which is not within a quoted string. To be valid it should be like: "Last, First"<name@domain.com>.

您的第二个电子邮件示例不是有效地址,因为它包含的逗号不在带引号的字符串中。为了有效,它应该是:“Last,First” @domain.com>

As for parsing, if you want something that is quite strict, you could use System.Net.Mail.MailAddressCollection.

至于解析,如果你想要一些非常严格的东西,你可以使用System.Net.Mail.MailAddressCollection。

If you just want to your input split into separate email strings, then the following code should work. It is not very strict but will handle commas within quoted strings and throw an exception if the input contains an unclosed quote.

如果您只想将输入拆分为单独的电子邮件字符串,则以下代码应该可以正常工作。它不是很严格,但会在引用的字符串中处理逗号,如果输入包含未闭合的引号则抛出异常。

public List<string> SplitAddresses(string addresses)
{
    var result = new List<string>();

    var startIndex = 0;
    var currentIndex = 0;
    var inQuotedString = false;

    while (currentIndex < addresses.Length)
    {
        if (addresses[currentIndex] == QUOTE)
        {
            inQuotedString = !inQuotedString;
        }
        // Split if a comma is found, unless inside a quoted string
        else if (addresses[currentIndex] == COMMA && !inQuotedString)
        {
            var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
            if (address.Length > 0)
            {
                result.Add(address);
            }
            startIndex = currentIndex + 1;
        }
        currentIndex++;
    }

    if (currentIndex > startIndex)
    {
        var address = GetAndCleanSubstring(addresses, startIndex, currentIndex);
        if (address.Length > 0)
        {
            result.Add(address);
        }
    }

    if (inQuotedString)
        throw new FormatException("Unclosed quote in email addresses");

    return result;
}

private string GetAndCleanSubstring(string addresses, int startIndex, int currentIndex)
{
    var address = addresses.Substring(startIndex, currentIndex - startIndex);
    address = address.Trim();
    return address;
}

#5


There is no generic simple solution to this. The RFC you want is RFC2822, which describes all of the possible configurations of an email address. The best you are going to get that will be correct is to implement a state-based tokenizer that follows the rules specified in the RFC.

对此没有通用的简单解决方案。您想要的RFC是RFC2822,它描述了电子邮件地址的所有可能配置。您将获得的最佳方法是实现遵循RFC中指定的规则的基于状态的标记生成器。

#6


Here is the solution i came up with to accomplish this:

这是我想出的解决方案:

String str = "Last, First <name@domain.com>, name@domain.com, First Last <name@domain.com>, \"First Last\" <name@domain.com>";

List<string> addresses = new List<string>();
int atIdx = 0;
int commaIdx = 0;
int lastComma = 0;
for (int c = 0; c < str.Length; c++)
{
if (str[c] == '@')
    atIdx = c;

if (str[c] == ',')
    commaIdx = c;

if (commaIdx > atIdx && atIdx > 0)
{
    string temp = str.Substring(lastComma, commaIdx - lastComma);
    addresses.Add(temp);
    lastComma = commaIdx;
    atIdx = commaIdx;
}

if (c == str.Length -1)
{
    string temp = str.Substring(lastComma, str.Legth - lastComma);
    addresses.Add(temp);
}
}

if (commaIdx < 2)
{
    // if we get here we can assume either there was no comma, or there was only one comma as part of the last, first combo
    addresses.Add(str);
}

#7


Here is how I would do it:

我是这样做的:

  • You can try to standardize the data as much as possible i.e. get rid of such things as the < and > symbols and all of the commas after the '.com.' You will need the commas that separate the first and last names.
  • 您可以尝试尽可能地标准化数据,即除去 <和> 符号以及'.com'之后的所有逗号。您将需要用于分隔名字和姓氏的逗号。

  • After getting rid of the extra symbols, put every grouped email record in a list as a string. You can use the .com to determine where to split the string if need be.
  • 在删除额外符号后,将每个分组的电子邮件记录作为字符串放在列表中。如果需要,您可以使用.com来确定拆分字符串的位置。

  • After you have the list of email addresses in the list of strings, you can then further split the email addresses using only whitespace as the delimeter.
  • 在字符串列表中有电子邮件地址列表后,您可以使用空格作为分隔符进一步拆分电子邮件地址。

  • The final step is to determine what is the first name, what is the last name, etc. This would be done by checking the 3 components for: a comma, which would indicate that it is the last name; a . which would indicate the actual address; and whatever is left is the first name. If there is no comma, then the first name is first, last name is second, etc.

    I don't know if this is the most concise solution, but it would work and does not require any advanced programming techniques
  • 最后一步是确定名字是什么,姓氏是什么等等。这可以通过检查3个组件来完成:逗号,表示它是姓氏;一个 。这表示实际地址;剩下的就是名字。如果没有逗号,那么第一个名字是第一个,最后一个名字是第二个,等等。我不知道这是否是最简洁的解决方案,但它可以工作,不需要任何高级编程技术

#8


You could use regular expressions to try to separate this out, try this guy:

您可以使用正则表达式尝试将其分开,试试这个人:

^(?<name1>[a-zA-Z0-9]+?),? (?<name2>[a-zA-Z0-9]+?),? (?<address1>[a-zA-Z0-9.-_<>]+?)$

will match: Last, First test@test.com; Last, First <test@test.com>; First last test@test.com; First Last <test@test.com>. You can add another optional match in the regex at the end to pick up the last segment of First, Last <name@domain.com>, name@domain.com after the email address enclosed in angled braces.

将匹配:Last,First test@test.com;最后,第一个 ;最后一次test@test.com; First Last 。您可以在最后的正则表达式中添加另一个可选匹配项,以便在包含在斜角括号中的电子邮件地址后选取最后一段 ,name @ domain.com。 ,last> @test.com> @test.com>

Hope this helps somewhat!

希望这有点帮助!

EDIT:

and of course you can add more characters to each of the sections to accept quotations etc for whatever format is being read in. As sjbotha mentioned, this could be difficult as the string that is submitted is not necessarily in a set format.

当然,您可以为每个部分添加更多字符以接受引用等任何格式正在读取。正如sjbotha所提到的,这可能很难,因为提交的字符串不一定是设置格式。

This link can give you more information about matching AND validating email addresses using regular expressions.

此链接可以为您提供有关使用正则表达式匹配和验证电子邮件地址的更多信息。

#9


// Based on Michael Perry's answer * // needs to handle first.last@domain.com, first_last@domain.com and related syntaxes // also looks for first and last name within those email syntaxes

//基于Michael Perry的回答* //需要处理first.last@domain.com,first_last@domain.com和相关语法//还会查找这些电子邮件语法中的名字和姓氏

public class ParsedEmail
{
    private string _first;
    private string _last;
    private string _name;
    private string _domain;

    public ParsedEmail(string first, string last, string name, string domain)
    {
        _name = name;
        _domain = domain;

        // first.last@domain.com, first_last@domain.com etc. syntax
        char[] chars = { '.', '_', '+', '-' };
        var pos = _name.IndexOfAny(chars);

        if (string.IsNullOrWhiteSpace(_first) && string.IsNullOrWhiteSpace(_last) && pos > -1)
        {
            _first = _name.Substring(0, pos);
            _last = _name.Substring(pos+1);
        }
    }

    public string First
    {
        get { return _first; }
    }

    public string Last
    {
        get { return _last; }
    }

    public string Name
    {
        get { return _name; }
    }

    public string Domain
    {
        get { return _domain; }
    }

    public string Email
    {
        get
        {
            return Name + "@" + Domain;
        }
    }

    public override string ToString()
    {
        return Email;
    }

    public static IEnumerable<ParsedEmail> SplitEmailList(string delimList)
    {
        delimList = delimList.Replace("\"", string.Empty);

        Regex re = new Regex(
                    @"((?<last>\w*), (?<first>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<first>\w*) (?<last>\w*) <(?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*)>)|" +
                    @"((?<name>[a-zA-Z_0-9\.\+\-]+)@(?<domain>\w*\.\w*))");


        MatchCollection matches = re.Matches(delimList);

        var parsedEmails =
                   (from Match match in matches
                    select new ParsedEmail(
                            match.Groups["first"].Value,
                            match.Groups["last"].Value,
                            match.Groups["name"].Value,
                            match.Groups["domain"].Value)).ToList();

        return parsedEmails;

    }


}

#10


I decided that I was going to draw a line in the sand at two restrictions:

我决定在两个限制条件下在沙滩上划一条线:

  1. The To and Cc headers have to be csv parseable strings.
  2. To和Cc头必须是csv可解析字符串。

  3. Anything MailAddress couldn't parse, I'm just not going to worry about it.
  4. 任何MailAddress都无法解析,我只是不担心它。

I also decided I'm just interested in email addresses and not display name, since display name is so problematic and hard to define, whereas email address I can validate. So I used MailAddress to validate my parsing.

我还决定我只对电子邮件地址感兴趣,而不是显示名称,因为显示名称是如此有问题且难以定义,而电子邮件地址我可以验证。所以我使用MailAddress来验证我的解析。

I treated the To and Cc headers like a csv string, and again, anything not parseable in that way I don't worry about it.

我把To和Cc标题视为csv字符串,再次,任何不可解析的东西我都不担心。

private string GetProperlyFormattedEmailString(string emailString)
    {
        var emailStringParts = CSVProcessor.GetFieldsFromString(emailString);

        string emailStringProcessed = "";

        foreach (var part in emailStringParts)
        {
            try
            {
                var address = new MailAddress(part);
                emailStringProcessed += address.Address + ",";
            }
            catch (Exception)
            {
                //wasn't an email address
                throw;
            }
        }

        return emailStringProcessed.TrimEnd((','));
    }

EDIT

Further research has showed me that my assumptions are good. Reading through the spec RFC 2822 pretty much shows that the To, Cc, and Bcc fields are csv-parseable fields. So yeah it's hard and there are a lot of gotchas, as with any csv parsing, but if you have a reliable way to parse csv fields (which TextFieldParser in the Microsoft.VisualBasic.FileIO namespace is, and is what I used for this), then you are golden.

进一步的研究表明我的假设是好的。阅读规范RFC 2822几乎可以看出To,Cc和Bcc字段是csv-parseable字段。所以是的,它很难,并且有很多陷阱,就像任何csv解析一样,但是如果你有一个可靠的方法来解析csv字段(Microsoft.VisualBasic.FileIO命名空间中的TextFieldParser是,我就是这个用的)那你就是金色的。

Edit 2

Apparently they don't need to be valid CSV strings...the quotes really mess things up. So your csv parser has to be fault tolerant. I made it try to parse the string, if it failed, it strips all quotes and tries again:

显然他们不需要是有效的CSV字符串...引号真的搞砸了。所以你的csv解析器必须是容错的。我试图解析字符串,如果失败,它会删除所有引号并再次尝试:

public static string[] GetFieldsFromString(string csvString)
    {
        using (var stringAsReader = new StringReader(csvString))
        {
            using (var textFieldParser = new TextFieldParser(stringAsReader))
            {
                SetUpTextFieldParser(textFieldParser, FieldType.Delimited, new[] {","}, false, true);

                try
                {
                    return textFieldParser.ReadFields();
                }
                catch (MalformedLineException ex1)
                {
                    //assume it's not parseable due to double quotes, so we strip them all out and take what we have
                    var sanitizedString = csvString.Replace("\"", "");

                    using (var sanitizedStringAsReader = new StringReader(sanitizedString))
                    {
                        using (var textFieldParser2 = new TextFieldParser(sanitizedStringAsReader))
                        {
                            SetUpTextFieldParser(textFieldParser2, FieldType.Delimited, new[] {","}, false, true);

                            try
                            {
                                return textFieldParser2.ReadFields().Select(part => part.Trim()).ToArray();
                            }
                            catch (MalformedLineException ex2)
                            {
                                return new string[] {csvString};
                            }
                        }
                    }
                }
            }
        }
    }

The one thing it won't handle is quoted accounts in an email i.e. "Monkey Header"@stupidemailaddresses.com.

它不会处理的一件事是在电子邮件中引用帐户,即“Monkey Header”@ stupidemailaddresses.com。

And here's the test:

这是测试:

[Subject(typeof(CSVProcessor))]
public class when_processing_an_email_recipient_header
{
    static string recipientHeaderToParse1 = @"""Lastname, Firstname"" <firstname_lastname@domain.com>" + "," +
                                           @"<testto@domain.com>, testto1@domain.com, testto2@domain.com" + "," +
                                           @"<testcc@domain.com>, test3@domain.com" + "," +
                                           @"""""Yes, this is valid""""@[emails are hard to parse!]" + "," +
                                           @"First, Last <name@domain.com>, name@domain.com, First Last <name@domain.com>"
                                           ;

    static string[] results1;
    static string[] expectedResults1;

    Establish context = () =>
    {
        expectedResults1 = new string[]
        {
            @"Lastname",
            @"Firstname <firstname_lastname@domain.com>",
            @"<testto@domain.com>",
            @"testto1@domain.com",
            @"testto2@domain.com",
            @"<testcc@domain.com>",
            @"test3@domain.com",
            @"Yes",
            @"this is valid@[emails are hard to parse!]",
            @"First",
            @"Last <name@domain.com>",
            @"name@domain.com",
            @"First Last <name@domain.com>"
        };
    };

    Because of = () =>
    {
        results1 = CSVProcessor.GetFieldsFromString(recipientHeaderToParse1);
    };

    It should_parse_the_email_parts_properly = () => results1.ShouldBeLike(expectedResults1);
}

#11


Here's what I came up with. It assumes that a valid email address must have one and only one '@' sign in it:

这就是我想出的。它假定有效的电子邮件地址必须只有一个“@”符号:

    public List<MailAddress> ParseAddresses(string field)
    {
        var tokens = field.Split(',');
        var addresses = new List<string>();

        var tokenBuffer = new List<string>();

        foreach (var token in tokens)
        {
            tokenBuffer.Add(token);

            if (token.IndexOf("@", StringComparison.Ordinal) > -1)
            {
                addresses.Add( string.Join( ",", tokenBuffer));
                tokenBuffer.Clear();
            }
        }

        return addresses.Select(t => new MailAddress(t)).ToList();
    }

#12


I use the following regular expression in Java to get email string from RFC-compliant email address:

我在Java中使用以下正则表达式从RFC兼容的电子邮件地址中获取电子邮件字符串:

[A-Za-z0-9]+[A-Za-z0-9._-]+@[A-Za-z0-9]+[A-Za-z0-9._-]+[.][A-Za-z0-9]{2,3}