将一个人的名字解析成其组成部分的简单方法?

时间:2022-09-13 10:53:52

A lot of contact management programs do this - you type in a name (e.g., "John W. Smith") and it automatically breaks it up internally into:

很多联系管理程序都是这样做的——你输入一个名字(比如“John W. Smith”),它就会自动地把名字分解成:

First name: John
Middle name: W.
Last name: Smith

姓:约翰中间名:w姓:史密斯

Likewise, it figures out things like "Mrs. Jane W. Smith" and "Dr. John Doe, Jr." correctly as well (assuming you allow for fields like "prefix" and "suffix" in names).

同样,它也能正确地指出“Jane W. Smith夫人”和“Dr. John Doe, Jr.”(假设您允许在名称中包含“前缀”和“后缀”之类的字段)。

I assume this is a fairly common things that people would want to do... so the question is... how would you do it? Is there a simple algorithm for this? Maybe a regular expression?

我认为这是人们想做的相当普遍的事情……问题是……你会怎么做?有一个简单的算法吗?也许一个正则表达式?

I'm after a .NET solution, but I'm not picky.

我在追求。net解决方案,但我不挑剔。

Update: I appreciate that there is no simple solution for this that covers ALL edge cases and cultures... but let's say for the sake of argument that you need the name in pieces (filling out forms - as in, say, tax or other government forms - is one case where you are bound to enter the name into fixed fields, whether you like it or not), but you don't necessarily want to force the user to enter their name into discrete fields (less typing = easier for novice users).

更新:我很欣赏没有简单的解决方案可以涵盖所有的边缘案例和文化……假设为了讨论需要作品的名称(填写表单,如说,税务或其他*形式——是一个情况你绑定到名称输入到固定字段,无论你喜欢与否),但你不一定要迫使用户输入他们的名字离散领域(新手用户能更容易地打字越少=)。

You'd want to have the program "guess" (as best it can) on what's first, middle, last, etc. If you can, look at how Microsoft Outlook does this for contacts - it lets you type in the name, but if you need to clarify, there's an extra little window you can open. I'd do the same thing - give the user the window in case they want to enter the name in discrete pieces - but allow for entering the name in one box and doing a "best guess" that covers most common names.

你想要这个项目“猜”(的)第一,中间,最后,等等。如果可以,看看这是Microsoft Outlook联系人——它让你输入的名字,但如果你需要澄清,有一个额外的小窗口可以打开。我也会做同样的事情——给用户一个窗口,以防他们想要在离散的片段中输入名称——但是允许输入一个框中的名字,并做一个包含大多数常用名称的“最佳猜测”。

22 个解决方案

#1


10  

There is no simple solution for this. Name construction varies from culture to culture, and even in the English-speaking world there's prefixes and suffixes that aren't necessarily part of the name.

对此没有简单的解决办法。不同的文化有不同的名称结构,甚至在英语世界里,前缀和后缀都不一定是名字的一部分。

A basic approach is to look for honorifics at the beginning of the string (e.g., "Hon. John Doe") and numbers or some other strings at the end (e.g., "John Doe IV", "John Doe Jr."), but really all you can do is apply a set of heuristics and hope for the best.

基本的方法是寻找敬称初的字符串(例如,“John Doe阁下”)和数字或其他字符串结束时(如“John Doe IV”、“John Doe Jr .)”),但实际上你所能做的就是用一组启发式和最好的希望。

It might be useful to find a list of unprocessed names and test your algorithm against it. I don't know that there's anything prepackaged out there, though.

找到一个未处理的名称列表并针对它测试算法可能会有用。但我不知道有什么是预先包装好的。

#2


31  

If you must do this parsing, I'm sure you'll get lots of good suggestions here.

如果您必须进行这种解析,我相信您将在这里得到许多很好的建议。

My suggestion is - don't do this parsing.

我的建议是——不要做这个解析。

Instead, create your input fields so that the information is already separated out. Have separate fields for title, first name, middle initial, last name, suffix, etc.

相反,创建输入字段,以便信息已经被分离出来。有独立的字段用于标题、名、中间名、姓、后缀等。

#3


14  

I know this is old and might be answers somewhere I couldn't find already, but since I couldn't find anything that works for me, this is what I came up with which I think works a lot like Google Contacts and Microsoft Outlook. It doesn't handle edge cases well, but for a good CRM type app, the user can always be asked to resolve those (in my app I actually have separate fields all the time, but I need this for data import from another app that only has one field):

我知道这是旧的,可能是我已经找不到的答案,但因为我找不到任何适合我的东西,这就是我想到的,我认为它很适合谷歌联系人和Microsoft Outlook。它不能很好地处理边界情况,但是对于一个良好的CRM类型应用程序,用户总是可以被要求解决这些问题(在我的应用中,我实际上一直都有不同的字段,但是我需要从另一个只有一个字段的应用程序中导入数据):

    public static void ParseName(this string s, out string prefix, out string first, out string middle, out string last, out string suffix)
    {
        prefix = "";
        first = "";
        middle = "";
        last = "";
        suffix = "";

        // Split on period, commas or spaces, but don't remove from results.
        List<string> parts = Regex.Split(s, @"(?<=[., ])").ToList();

        // Remove any empty parts
        for (int x = parts.Count - 1; x >= 0; x--)
            if (parts[x].Trim() == "")
                parts.RemoveAt(x);

        if (parts.Count > 0)
        {
            // Might want to add more to this list
            string[] prefixes = { "mr", "mrs", "ms", "dr", "miss", "sir", "madam", "mayor", "president" };

            // If first part is a prefix, set prefix and remove part
            string normalizedPart = parts.First().Replace(".", "").Replace(",", "").Trim().ToLower();
            if (prefixes.Contains(normalizedPart))
            {
                prefix = parts[0].Trim();
                parts.RemoveAt(0);
            }
        }

        if (parts.Count > 0)
        {
            // Might want to add more to this list, or use code/regex for roman-numeral detection
            string[] suffixes = { "jr", "sr", "i", "ii", "iii", "iv", "v", "vi", "vii", "viii", "ix", "x", "xi", "xii", "xiii", "xiv", "xv" };

            // If last part is a suffix, set suffix and remove part
            string normalizedPart = parts.Last().Replace(".", "").Replace(",", "").Trim().ToLower();
            if (suffixes.Contains(normalizedPart))
            {
                suffix = parts.Last().Replace(",", "").Trim();
                parts.RemoveAt(parts.Count - 1);
            }
        }

        // Done, if no more parts
        if (parts.Count == 0)
            return;

        // If only one part left...
        if (parts.Count == 1)
        {
            // If no prefix, assume first name, otherwise last
            // i.e.- "Dr Jones", "Ms Jones" -- likely to be last
            if(prefix == "")
                first = parts.First().Replace(",", "").Trim();
            else
                last = parts.First().Replace(",", "").Trim();
        }

        // If first part ends with a comma, assume format:
        //   Last, First [...First...]
        else if (parts.First().EndsWith(","))
        {
            last = parts.First().Replace(",", "").Trim();
            for (int x = 1; x < parts.Count; x++)
                first += parts[x].Replace(",", "").Trim() + " ";
            first = first.Trim();
        }

        // Otherwise assume format:
        // First [...Middle...] Last

        else
        {
            first = parts.First().Replace(",", "").Trim();
            last = parts.Last().Replace(",", "").Trim();
            for (int x = 1; x < parts.Count - 1; x++)
                middle += parts[x].Replace(",", "").Trim() + " ";
            middle = middle.Trim();
        }
    }

Sorry that the code is long and ugly, I haven't gotten around to cleaning it up. It is a C# extension, so you would use it like:

对不起,代码又长又丑,我还没来得及清理。它是一个c#扩展,所以您可以使用它:

string name = "Miss Jessica Dark-Angel Alba";
string prefix, first, middle, last, suffix;
name.ParseName(out prefix, out first, out middle, out last, out suffix);

#4


5  

You probably don't need to do anything fancy really. Something like this should work.

你可能不需要做什么特别的事情。像这样的东西应该有用。

    Name = Name.Trim();

    arrNames = Name.Split(' ');

    if (arrNames.Length > 0) {
        GivenName = arrNames[0];
    }
    if (arrNames.Length > 1) {
        FamilyName = arrNames[arrNames.Length - 1];
    }
    if (arrNames.Length > 2) {
        MiddleName = string.Join(" ", arrNames, 1, arrNames.Length - 2);
    }

You may also want to check for titles first.

您可能还想先检查标题。

#5


4  

I had to do this. Actually, something much harder than this, because sometimes the "name" would be "Smith, John" or "Smith John" instead of "John Smith", or not a person's name at all but instead a name of a company. And it had to do it automatically with no opportunity for the user to correct it.

我必须这样做。实际上,比这更难的是,因为有时候“名字”会是“史密斯,约翰”或“史密斯约翰”而不是“约翰史密斯”,或者根本不是一个人的名字,而是一个公司的名字。它必须自动完成,用户没有机会纠正它。

What I ended up doing was coming up with a finite list of patterns that the name could be in, like:
Last, First Middle-Initial
First Last
First Middle-Initial Last
Last, First Middle
First Middle Last
First Last

我最后做的是列出一个有限的模式列表,名字可以在其中,比如:Last, First Middle- First First First First First First Last, Last, Last, Last, Middle- First, First Middle- First, Middle- Middle, Last, Last, Last, Last

Throw in your Mr's, Jr's, there too. Let's say you end up with a dozen or so patterns.

再加上你先生的,Jr的。假设你最终得到了十几种模式。

My application had a dictionary of common first name, common last names (you can find these on the web), common titles, common suffixes (jr, sr, md) and using that would be able to make real good guesses about the patterns. I'm not that smart, my logic wasn't that fancy, and yet still, it wasn't that hard to create some logic that guessed right more than 99% of the time.

我的应用程序有一个通用名、通用名(您可以在web上找到这些)、通用名、通用后缀(jr、sr、md)的字典,使用它可以对模式进行真正的猜测。我没那么聪明,我的逻辑也没那么花哨,但即便如此,创造一些在99%的时间里都能猜对的逻辑也没那么难。

#6


3  

I appreciate that this is hard to do right - but if you provide the user a way to edit the results (say, a pop-up window to edit the name if it didn't guess right) and still guess "right" for most cases... of course it's the guessing that's tough.

我很欣赏这很难做到正确——但如果你为用户提供一种编辑结果的方式(比如,如果用户猜错了,就会弹出一个窗口来编辑名字),并且在大多数情况下仍然猜对了……当然,这是一种艰难的猜测。

It's easy to say "don't do it" when looking at the problem theoretically, but sometimes circumstances dictate otherwise. Having fields for all the parts of a name (title, first, middle, last, suffix, just to name a few) can take up a lot of screen real estate - and combined with the problem of the address (a topic for another day) can really clutter up what should be a clean, simple UI.

从理论上看问题时,我们很容易说“不要做”,但有时情况并非如此。有字段的所有部件名称(标题,首先,中间,最后,后缀,仅举几例)可以占用大量的屏幕空间,结合问题的解决(另一个主题)可以真正使杂乱应该干净、简单的UI。

I guess the answer should be "don't do it unless you absolutely have to, and if you do, keep it simple (some methods for this have been posted here) and provide the user the means to edit the results if needed."

我想答案应该是“除非你必须这样做,否则不要这样做,如果你这样做了,保持简单(这里已经发布了一些方法),并为用户提供编辑结果的方法(如果需要的话)。”

#7


3  

Understanding this is a bad idea, I wrote this regex in perl - here's what worked the best for me. I had already filtered out company names.
Output in vcard format: (hon_prefix, given_name, additional_name, family_name, hon. suffix)

理解这一点不是一个好主意,我用perl编写了这个regex—以下是对我最有效的方法。我已经过滤掉了公司的名字。vcard格式的输出:(hon_prefix、given_name、additional_name、family_name、hon. suffix)

/^ \s*
    (?:((?:Dr.)|(?:Mr.)|(?:Mr?s.)|(?:Miss)|(?:2nd\sLt.)|(?:Sen\.?))\s+)? # prefix
    ((?:\w+)|(?:\w\.)) # first name
(?: \s+ ((?:\w\.?)|(?:\w\w+)) )?  # middle initial
(?: \s+ ((?:[OD]['’]\s?)?[-\w]+))    # last name
(?: ,? \s+ ( (?:[JS]r\.?) | (?:Esq\.?) | (?: (?:M)|(?:Ph)|(?:Ed) \.?\s*D\.?) | 
         (?: R\.?N\.?) | (?: I+) )  )? # suffix
\s* $/x

notes:

注:

  • doesn't handle IV, V, VI
  • 不处理IV, V, VI。
  • Hard-coded lists of prefixes, suffixes. evolved from dataset of ~2K names
  • 硬编码的前缀,后缀。由~2K个名称的数据集演化而来
  • Doesn't handle multiple suffixes (eg. MD, PhD)
  • 不处理多个后缀(如。医学博士)
  • Designed for American names - will not work properly on romanized Japanese names or other naming systems
  • 为美国人的名字而设计-将不能在罗马化的日本名字或其他命名系统上正常工作

#8


2  

There are a few add-ins we have used in our company to accomplish this. I ended up creating a way to actually specify the formats for the name on our different imports for different clients. There is a company that has a tool that in my experience is well worth the price and is really incredible when tackling this subject. It's at: http://www.softwarecompany.com/ and works great. The most efficient way to do this w/out using any statistical approach is to split the string by commas or spaces then: 1. strip titles and prefixes out 2. strip suffixes out 3, parse name in the order of ( 2 names = F & L, 3 names = F M L or L M F) depending on order of string().

我们公司已经使用了一些插件来完成这个任务。最后我创建了一种方法来实际指定不同客户端的不同导入的名称的格式。在我的经验中,有一家公司的工具非常物有所值,而且在处理这个问题的时候真的是难以置信。网址是:http://www.softwarecompany.com/,效果很好。使用任何统计方法来做这个w/out的最有效的方法是用逗号或空格来分隔字符串:1。去掉标题和前缀2。带后缀3,解析名的顺序(2个名称= F & L, 3个名称= F M L或L M F)取决于字符串()的顺序。

#9


2  

The real solution here does not answer the question. The portent of the information must be observed. A name is not just a name; it is how we are known.

真正的解决方案并不能回答这个问题。必须遵守信息的预示。名字不只是名字;这就是我们为人所知的方式。

The problem here is not knowing exactly what parts are labled what, and what they are used for. Honorable prefixes should be granted only in personal corrospondences; Doctor is an honorific that is derived from a title. All information about a person is relavent to their identity, it is just determining what is relavent information. You need a first and last name for reasons of administration; phone number, email addresses, land descriptions and mailing addresses; all to the portent of identity, knowing who you are dealing with.

这里的问题是不知道具体的零件是什么,它们的用途是什么。荣誉的前缀只在个人的证据中被授予;医生是源自头衔的敬语。关于一个人的所有信息都与他们的身份有关,它只是决定什么是相对信息。由于行政原因,你需要一个姓和名;电话号码、电邮地址、土地说明及邮寄地址;所有的一切都预示着身份,知道你在和谁打交道。

The real problem here is that the person gets lost in the administration. All of a sudden, after only entering their personal information into a form and submitting it to an arbitrary program for processing, they become afforded all sorts of honorifics and pleasentries spewed out by a prefabricated template. This is wrong; honorable Sir or Madam, if personal interest is shown toward the reason of corrospondence, then a letter should never be written from a template. Personal corrospondance requires a little knowledge about the recipient. Male or female, went to school to be a doctor or judge, what culture in which they were raised.

真正的问题是这个人在管理中迷失了方向。突然之间,他们只把自己的个人信息输入到一个表格中,然后提交给一个任意的程序进行处理,就会被一个预先制作好的模板提供给他们各种各样的荣誉和欢乐。这是错误的;尊敬的先生或女士,如果个人利益被证明是出于确证的原因,那么一封信函不应该从模板上书写。个人需要对接受者有一点了解。男性或女性,去学校当医生或法官,他们是在什么文化中长大的。

In other cultures, a name is made up from a variable number of characters. The person's name to us can only be interpretted as a string of numbers where the spaces are actually determined by character width instead of the space character. Honorifics in these cases are instead one or many characters prefixing and suffixing the actual name. The polite thing to do is use the string you are given, if you know the honorific then by all means use it, but this again implies some sort of personal knowledge of the recipient. Calling Sensei anything other than Sensei is wrong. Not in the sense of a logic error, but in that you have just insulted your caller, and now you should find a template that helps you apologize.

在其他文化中,名字是由不同数量的字符组成的。对我们来说,人名只能解释为一串数字,其中的空格实际上是由字符宽度决定的,而不是空格字符。在这些情况下,Honorifics是一个或多个字符前缀和后缀的实际名称。礼貌的做法是使用你所得到的字符串,如果你知道敬语,那么一定要使用它,但这也意味着你对接受者有某种个人的了解。除了唤醒以外,调用唤醒都是错误的。不是逻辑上的错误,而是你刚刚侮辱了你的来电者,现在你应该找到一个模板来帮助你道歉。

For the purposes of automated, impersonal corrospondence, a template may be devised for such things as daily articles, weekly issues or whatever, but the problem becomes important when corrospondence is instigated by the recipient to an automated service.

为了实现自动化、非人情味的腐蚀,可以为日常文章、每周问题或其他任何事情设计一个模板,但是当腐蚀被接受者鼓动到自动化服务中时,这个问题就变得很重要了。

What happens is an error. Missing information. Unknown or missing information will always generate an Exception. The real problem is not how do you seperate a person's name into its seperate components with an expression, but what do you call them.

发生的是一个错误。丢失的信息。未知或缺失的信息总是会产生异常。真正的问题不在于如何用表达式将一个人的名字分成不同的部分,而在于如何称呼他们。

The solution is to create an extra field, make it optional if there is already a first and last name, and call it "What may we call you" or "How should we refer to you as". A doctor and a judge will ensure you address them properly. These are not programming issues, they are issues of communication.

解决方案是创建一个额外的字段,如果已经有了第一个和最后一个名字,则将其设置为可选,并将其命名为“我们可以叫你什么”或“我们应该如何称呼你为”。医生和法官会确保你能恰当地称呼他们。这些不是编程问题,而是沟通问题。

Ok, bad way to put it, but in my opinion, Username, Tagname, and ID are worse. So my solution; is the missing question, "What should we call you?"

不好的说法,但在我看来,用户名,标签名和ID更糟糕。所以我的解决方案;“我们该叫你什么?”

This is only a solution where you can afford to make a new question. Tact prevails. Create a new field upon your user forms, call it Alias, label for the user "What should we call you?", then you have a means to communicate with. Use the first and last name unless the recipient has given an Alias, or is personally familiar with the sender then first and middle is acceptable.

这只是一个你可以提出一个新问题的解决方案。机智盛行。在您的用户表单上创建一个新字段,调用它Alias,为用户标记“我们应该怎么称呼您?”然后你就有了沟通的方式。使用第一个和最后一个名字,除非收件人提供了一个别名,或者个人对发件人很熟悉,否则可以使用第一个和中间的名字。

To Me, _______________________ (standard subscribed corrospondence)
To Me ( Myself | I ), ________ (standard recipient instigated corrospondence)
To Me Myself I, ______________ (look out, its your mother, and you're in big trouble;
                                nobody addresses a person by their actual full name)

Dear *(Mr./Mrs./Ms./Dr./Hon./Sen.) Me M. I *(I),
To Whom it may Concern;

Otherwise you are looking for something standard: hello, greetings, you may be a winner.

否则,你在寻找一些标准的东西:你好,问候,你可能是一个赢家。

Where you have data that is a person's name all in one string, you don't have a problem because you already have their alias. If what you need is the first and last name, then just Left(name,instr(name," ")) & " " & Right(name,instrrev(name," ")), my math is probably wrong, i'm a bit out of practice. compare left and right with known prefixes and suffixes and eliminate them from your matches. Generally the middle name is rarely used except for instances of confirming an identity; which an address or phone number tells you a lot more. Watching for hyphanation, one can determine that if the last name is not used, then one of the middle ones would be instead.

如果数据是一个字符串中的一个人的名字,你就不会有问题,因为你已经有了他们的别名。如果你需要的是第一个和最后一个名字,然后是左(name,instr(name," ")) & Right(name,提高妇女地位),我的计算可能是错误的,我有点不熟练。将左和右的前缀和后缀与已知的前缀和后缀进行比较,并从匹配中删除它们。通常,除了确认身份的实例外,很少使用中间名;哪个地址或电话号码告诉你更多。观察hyphanation,可以确定如果不使用姓氏,那么中间的一个就会被使用。

For searching lists of first and last names, one must consider the possibility that one of the middle ones was instead used; this would require four searches: one to filter for first & last, then another to filter first & middle, then another to filter middle & last, and then another to filter middle & middle. Ultimately, the first name is always first, and the last is always last, and there can be any number of middle names; less is more, and where zero is likely, but improbable.

对于搜索名字和名字的列表,必须考虑中间的一个可能被使用;这将需要四次搜索:一次过滤第一和最后,另一次过滤第一和中间,另一次过滤中间和最后,然后再过滤中间和中间。最后,名字总是第一个,最后一个总是最后一个,中间的名字可以有任意数量;少即是多,有可能是零,但不太可能。

Sometimes people prefer to be called Bill, Harry, Jim, Bob, Doug, Beth, Sue, or Madonna; than their actual names; similar, but unrealistically expected of anyone to fathom all the different possibilities.

有时人们喜欢被叫做比尔、哈里、吉姆、鲍勃、道格、贝丝、苏或麦当娜;比他们的实际名称;相似的,但不切实际的期望任何人都能发现所有不同的可能性。

The most polite thing you could do, is ask; What can we call you?

你能做的最礼貌的事,就是问;我们可以叫你什么?

#10


1  

You can do the obvious things: look for Jr., II, III, etc. as suffixes, and Mr., Mrs., Dr., etc. as prefixes and remove them, then first word is first name, last word is last name, everything in between are middle names. Other than that, there's no foolproof solution for this.

您可以做一些显而易见的事情:寻找Jr、II、III等作为后缀,Mr、Mrs、Dr等作为前缀并删除,然后第一个单词是名,最后一个单词是姓,中间的所有东西都是中间名。除此之外,没有万无一失的解决办法。

A perfect example is David Lee Roth (last name: Roth) and Eddie Van Halen (last name: Van Halen). If Ann Marie Smith's first name is "Ann Marie", there's no way to distinguish that from Ann having a middle name of Marie.

一个完美的例子是大卫·李·罗斯(姓氏:罗斯)和埃迪·范·哈伦(姓氏:范·哈伦)。如果安·玛丽·史密斯的第一个名字是“安·玛丽”,那就没有办法区别出安有一个中间名的玛丽。

#11


1  

If you simply have to do this, add the guesses to the UI as an optional selection. This way, you could tell the user how you parsed the name and let them pick a different parsing from a list you provide.

如果您只需要这样做,那么将猜测添加到UI作为可选选项。通过这种方式,您可以告诉用户如何解析名称,并让他们从您提供的列表中选择不同的解析。

#12


1  

There is a 3rd party tool for this kind of thing called NetGender that works surprisingly well. I used it to parse a massive amount of really mal-formed names in unpredictable formats. Take a look at the examples on their page, and you can download and try it as well.

有一种第三方工具,叫做NetGender,它非常好用。我用它来解析大量格式错误的名字,格式难以预测。看看他们页面上的例子,你可以下载并尝试一下。

http://www.softwarecompany.com/dotnet/netgender.htm

http://www.softwarecompany.com/dotnet/netgender.htm

I came up with these statistics based on a sampling of 4.2 million names. Name Parts means the number of distinct parts separated by spaces. A very high percentage were correct for most names in the database. The correctness went down as the parts went up, but there were very few names with >3 parts and fewer with >4. This was good enough for our case. Where the software fell down was recognizing not-well-known multi-part last names, even when separated by a comma. If it was able to decipher this, then the number of mistakes in total would have been less than 1% for all data.

我根据420万个名字的样本得出了这些统计数据。名称部分是由空格分隔的不同部分的数量。对于数据库中的大多数名称,正确的比例非常高。随着零件数量的增加,正确率下降,但是>3零件的名字很少,>4零件的名字更少。这对我们的案子来说已经足够了。软件崩溃的地方是识别不知名的多部分姓,即使用逗号分隔。如果它能破译这个,那么所有数据的错误总数将不到1%。

Name Parts | Correct | Percent of Names in DB
 2             100%       48%
 3              98%       42%
 4              70%        9%
 5              45%        0.25%

#13


1  

I already do this server-side on page load. Wrote a Coldfusion CFC that gets two params passed to it - actual user data(first name, middle, last name) and data type(first,middle,last). Then routine checks for hyphens, apostrophes, spaces and formats accordingly. ex. MacDonald, McMurray, O'Neill, Rodham-Clinton, Eric von Dutch, G. W. Bush, Jack Burton Jr., Dr. Paul Okin, Chris di Santos. For case where users only have one name, only the first name field is required, middle and last names are optional.

我已经在页面负载上做了这个服务器端。编写了一个Coldfusion CFC,它获取传递给它的两个参数——实际的用户数据(名、中间、姓)和数据类型(第一、中间、最后)。然后对连字符、撇号、空格和格式进行常规检查。麦克唐纳、麦克默里、奥尼尔、罗汉-克林顿、埃里克·冯·荷兰、乔治·w·布什、小杰克·伯顿、保罗·奥金博士、克里斯·迪·桑托斯。对于用户只有一个名称的情况,只需要第一个名称字段,中间和最后一个名称是可选的。

All info is stored lowercase - except Prefix, Suffix and Custom. This formatting is done on page render, not during store to db. Though there is validation filtering when user inputs data. Sorry, cannot post code. Started out using Regex but became too confusing and unreliable for all scenarios. Used standard logic blocks(if/else, switch/case), easier to read and debug. MAKE EVERY INPUT/DB FIELD SEPARATE! Yes, this will take some coding, but after you are finished it should account for 99% of combinations. Only based on English names so far, no internationalization, that's another ball of wax.

所有信息都存储为小写——除了前缀、后缀和习惯。这种格式是在页面呈现时完成的,而不是在存储到db时。虽然在用户输入数据时存在验证过滤。对不起,不能邮政编码。一开始使用Regex,但对所有场景都变得太过混乱和不可靠。使用标准逻辑块(if/else、switch/case),更容易阅读和调试。使每个输入/DB字段分开!是的,这将需要一些编码,但在您完成之后,它应该占99%的组合。到目前为止,只基于英语名称,没有国际化,那是另一个蜡球。

Here's some things to consider:

这里有一些事情需要考虑:

  • Hypens (ex. Rodham-Clinton, could be in first, middle or last)
  • Hypens(前国务卿罗丹-克林顿,可以是第一、中、末)
  • Apostrophes (ex. O'Neill, could be in first, middle or last)
  • 撇号(例如O . neill)可以在第一,中间或者最后)
  • Spaces
  • 空间
  • Mc and Mac (ex. McDonald, MacMurray, could be in first, middle or
    last)
  • Mc和Mac(例如,麦当劳,MacMurray,可能是第一,中,也可能是最后)
  • First names: multiple first names (ex. Joe Bob Briggs)
  • 名字:多重名字(前乔·鲍勃·布里格斯)
  • Last names: de,di,et,der,den,van,von,af should be lowercase (ex Eric von Dander, Mary di Carlo)
  • 姓:de,di,et,der,den,van,von,af应该是小写的(前Eric von Dander, Mary di Carlo)
  • Prefix: Dr., Prof., etc
  • 前缀:博士,教授,等等
  • Suffix: Jr., Sr., Esq., II, III, etc
  • 老后缀:Jr .)收。,二,三,等等

When user enters info, field schema in db is like so:

当用户输入信息时,db中的字段模式是这样的:

  • Prefix/Title (Dr., etc using a dropdown)
  • 前缀/标题(Dr.等使用下拉)
  • Prefix/Title Custom (user can enter custom, ex. Capt. using a text field)
  • 前缀/标题自定义(用户可以使用文本字段输入Custom,例如capt)
  • First Name
  • 第一个名字
  • Middle
  • 中间
  • Last Name
  • Suffix (Jr., III, Prof., Ret., etc using a dropdown)
  • 后缀(Jr.、III、Prof.、Ret等,使用下拉)
  • Suffix Custom (user can enter custom, ex. CPA)
  • 后缀自定义(用户可以输入自定义,例如CPA)

Here's the one Regex I do use to make first letter of each name uppercase. I run this first, then following routines format according to rules(it's in Coldfusion format but you get the idea):

这是我用来制作每个大写字母的第一个正则表达式。我先运行这个,然后按照规则执行例程格式(它是Coldfusion格式,但是你懂的):

<cfset var nameString = REReplace(LCase(nameString), "(^[[:alpha:]]|[[:blank:]][[:alpha:]])", "\U\1\E", "ALL")>

You could also do this client-side using JavaScript and CSS - might even be easier - but I prefer to do server-side since I need the variables set before page loads client-side.

您还可以使用JavaScript和CSS在客户端完成这个任务——这可能更简单——但我更喜欢做服务器端,因为我需要在页面加载客户端之前设置变量。

#14


0  

I would say Strip out salutations from a list then split by space, placing list.first() as first name, list.last() as last name then join the remainder by a space and have that as a middle name. And ABOVE ALL display your results and let the user modify them!

我将从列表中删除问候语,然后按空格分隔,将list.first()作为第一个名称,list.last()作为最后一个名称,然后用空格连接其余的名称,并将其作为中间名称。最重要的是显示你的结果并让用户修改它们!

#15


0  

Sure, there is a simple solution - split the string by spaces, count the number of tokens, if there is 2, interpret them to be FIRST and LAST name, if there is 3, interpret it to be FIRST, MIDDLE, and LAST.

当然,有一个简单的解决方案——按空格分割字符串,计数令牌数量,如果有2个,将它们解释为姓和名,如果有3个,将其解释为第一个、中间和最后一个。

The problem is that the simple solution will not be a 100% correct solution - someone could always enter a name with many more tokens, or could include titles, last names with a space in it (is this possible?), etc. You can come up with a solution that works for most names most of the time, but not an absolute solution.

问题是简单的解决方案不会100%正确的解决方案——人总是可以用更多的令牌,输入一个名称或可能包括标题、姓氏有一个空间(这是可能的吗?),等等。你可以想出一个解决方案,适用于大多数的名字大多数时候,但不是绝对的解决方案。

I would follow Shad's recommendation to split the input fields.

我将按照Shad的建议分割输入字段。

#16


0  

You don't want to do this, unless you are only going to be contacting people from one culture.

你不想这么做,除非你只是想接触来自同一种文化的人。

For example:

例如:

Guido van Rossum's last name is van Rossum.

Guido van Rossum的姓是van Rossum。

MIYAZAKI Hayao's first name is Hayao.

宫崎骏的名字叫Hayao。

The most success you could do is to strip off common titles and salutations, and try some heuristics.

你能做的最成功的事就是去掉常见的头衔和称呼,然后尝试一些启发式。

Even so, the easiest solution is to just store the full name, or ask for given and family name seperately.

即便如此,最简单的解决方案是只存储全名,或者分别请求给定和姓。

#17


0  

This is a fools errand. Too many exceptions to be able to do this deterministically. If you were doing this to pre-process a list for further review I would contend that less would certainly be more.

这是愚蠢的错误。有太多的例外情况无法确定地做到这一点。如果你这样做是为了预先处理一个列表,以便进行进一步的检查,我认为少肯定是多。

  1. Strip out salutations, titles and generational suffixes (big regex, or several small ones)
  2. 去掉称呼、头衔和代际后缀(大regex或几个小的)
  3. if only one name, it is 'last'.
  4. 如果只有一个名字,那就是“last”。
  5. If only two names split them first,last.
  6. 如果只有两个名字把他们分开,最后。
  7. If three tokens and middle is initial split them first, middle, last
  8. 如果三个令牌和中间是初始的,首先,中间,最后
  9. Sort the rest by hand.
  10. 其余的用手分类。

Any further processing is almost guaranteed to create more work as you have to go through recombining what your processing split-up.

任何进一步的处理几乎都可以保证创建更多的工作,因为您必须重新组合您的处理分解。

#18


0  

I agree, there's no simple solution for this. But I found an awful approach in a Microsoft KB article for VB 5.0 that is an actual implementation to much of the discussion talked about here: http://support.microsoft.com/kb/168799

我同意,没有简单的解决办法。但是我在为VB 5.0撰写的一篇Microsoft KB文章中发现了一种糟糕的方法,它实际上实现了本文讨论的大部分内容:http://support.microsoft.com/kb/168799

Something like this could be used in a pinch.

像这样的东西可以在必要时使用。

#19


0  

There is no 100% way to do this.

没有100%的方法可以做到这一点。

You can split on spaces, and try to understand the name all you want, but when it comes down to it, you will get it wrong sometimes. If that is good enough, go for any of the answers here that give you ways to split.

你可以在空格上分开,试着去理解你想要的名字,但是当它出现的时候,你有时会出错。如果这足够好,那就去找任何能让你分裂的答案。

But some people will have a name like "John Wayne Olson", where "John Wayne" is the first name, and someone else will have a name like "John Wayne Olson" where "Wayne" is their middle name. There is nothing present in that name that will tell you which way to interpret it.

但有些人会有“约翰·韦恩·奥尔森”(John Wayne Olson)这样的名字,“约翰·韦恩”是他们的名字,而另一些人会有“约翰·韦恩·奥尔森”这样的名字,“韦恩”是他们的中间名。这个名字里没有任何东西能告诉你该怎么解释它。

That's just the way it is. It's an analogue world.

事情就是这样。这是一个模拟世界。

My rules are pretty simple.

我的规则很简单。

Take the last part --> Last Name
If there are multiple parts left, take the last part --> Middle name
What is left --> First name

最后一部分——>的姓,如果有多个部分,取最后一部分——>中间名——>名。

But don't assume this will be 100% accurate, nor will any other hardcoded solution. You will need to have the ability to let the user edit this him/her-self.

但是不要假设这是100%准确的,也不会有任何其他硬编码的解决方案。您将需要有能力让用户编辑这个他/她自己。

#20


0  

I did something similar. The main problem I had was when people entered stuff like "Richard R. Smith, Jr." I posted my code at http://www.blackbeltcoder.com/Articles/strings/splitting-a-name-into-first-and-last-names. It's in C# but could easily be converted to VB.

我做了一些类似的。我遇到的主要问题是当人们输入诸如“小理查德·r·史密斯”之类的东西时。我将代码发布在http://www.blackbeltcoder.com/articles/strings/spliting -a-name-in - to-first- last-name。它在c#中,但是可以很容易地转换为VB。

#21


-1  

I agree with not to do this . The name Rick Van DenBoer would end up with a middle name of Van but it's part of the last name.

我同意不做这件事。Rick Van DenBoer这个名字最后会有一个中间名Van,但它是姓的一部分。

#22


-1  

SEE MORE DISCUSSION (almost exactly 1 year ago):
http://discuss.joelonsoftware.com/default.asp?design.4.551889.41

查看更多的讨论(几乎一年前):http://discuss.joelonsoftware.com/default.asp?design.4.551889.41

#1


10  

There is no simple solution for this. Name construction varies from culture to culture, and even in the English-speaking world there's prefixes and suffixes that aren't necessarily part of the name.

对此没有简单的解决办法。不同的文化有不同的名称结构,甚至在英语世界里,前缀和后缀都不一定是名字的一部分。

A basic approach is to look for honorifics at the beginning of the string (e.g., "Hon. John Doe") and numbers or some other strings at the end (e.g., "John Doe IV", "John Doe Jr."), but really all you can do is apply a set of heuristics and hope for the best.

基本的方法是寻找敬称初的字符串(例如,“John Doe阁下”)和数字或其他字符串结束时(如“John Doe IV”、“John Doe Jr .)”),但实际上你所能做的就是用一组启发式和最好的希望。

It might be useful to find a list of unprocessed names and test your algorithm against it. I don't know that there's anything prepackaged out there, though.

找到一个未处理的名称列表并针对它测试算法可能会有用。但我不知道有什么是预先包装好的。

#2


31  

If you must do this parsing, I'm sure you'll get lots of good suggestions here.

如果您必须进行这种解析,我相信您将在这里得到许多很好的建议。

My suggestion is - don't do this parsing.

我的建议是——不要做这个解析。

Instead, create your input fields so that the information is already separated out. Have separate fields for title, first name, middle initial, last name, suffix, etc.

相反,创建输入字段,以便信息已经被分离出来。有独立的字段用于标题、名、中间名、姓、后缀等。

#3


14  

I know this is old and might be answers somewhere I couldn't find already, but since I couldn't find anything that works for me, this is what I came up with which I think works a lot like Google Contacts and Microsoft Outlook. It doesn't handle edge cases well, but for a good CRM type app, the user can always be asked to resolve those (in my app I actually have separate fields all the time, but I need this for data import from another app that only has one field):

我知道这是旧的,可能是我已经找不到的答案,但因为我找不到任何适合我的东西,这就是我想到的,我认为它很适合谷歌联系人和Microsoft Outlook。它不能很好地处理边界情况,但是对于一个良好的CRM类型应用程序,用户总是可以被要求解决这些问题(在我的应用中,我实际上一直都有不同的字段,但是我需要从另一个只有一个字段的应用程序中导入数据):

    public static void ParseName(this string s, out string prefix, out string first, out string middle, out string last, out string suffix)
    {
        prefix = "";
        first = "";
        middle = "";
        last = "";
        suffix = "";

        // Split on period, commas or spaces, but don't remove from results.
        List<string> parts = Regex.Split(s, @"(?<=[., ])").ToList();

        // Remove any empty parts
        for (int x = parts.Count - 1; x >= 0; x--)
            if (parts[x].Trim() == "")
                parts.RemoveAt(x);

        if (parts.Count > 0)
        {
            // Might want to add more to this list
            string[] prefixes = { "mr", "mrs", "ms", "dr", "miss", "sir", "madam", "mayor", "president" };

            // If first part is a prefix, set prefix and remove part
            string normalizedPart = parts.First().Replace(".", "").Replace(",", "").Trim().ToLower();
            if (prefixes.Contains(normalizedPart))
            {
                prefix = parts[0].Trim();
                parts.RemoveAt(0);
            }
        }

        if (parts.Count > 0)
        {
            // Might want to add more to this list, or use code/regex for roman-numeral detection
            string[] suffixes = { "jr", "sr", "i", "ii", "iii", "iv", "v", "vi", "vii", "viii", "ix", "x", "xi", "xii", "xiii", "xiv", "xv" };

            // If last part is a suffix, set suffix and remove part
            string normalizedPart = parts.Last().Replace(".", "").Replace(",", "").Trim().ToLower();
            if (suffixes.Contains(normalizedPart))
            {
                suffix = parts.Last().Replace(",", "").Trim();
                parts.RemoveAt(parts.Count - 1);
            }
        }

        // Done, if no more parts
        if (parts.Count == 0)
            return;

        // If only one part left...
        if (parts.Count == 1)
        {
            // If no prefix, assume first name, otherwise last
            // i.e.- "Dr Jones", "Ms Jones" -- likely to be last
            if(prefix == "")
                first = parts.First().Replace(",", "").Trim();
            else
                last = parts.First().Replace(",", "").Trim();
        }

        // If first part ends with a comma, assume format:
        //   Last, First [...First...]
        else if (parts.First().EndsWith(","))
        {
            last = parts.First().Replace(",", "").Trim();
            for (int x = 1; x < parts.Count; x++)
                first += parts[x].Replace(",", "").Trim() + " ";
            first = first.Trim();
        }

        // Otherwise assume format:
        // First [...Middle...] Last

        else
        {
            first = parts.First().Replace(",", "").Trim();
            last = parts.Last().Replace(",", "").Trim();
            for (int x = 1; x < parts.Count - 1; x++)
                middle += parts[x].Replace(",", "").Trim() + " ";
            middle = middle.Trim();
        }
    }

Sorry that the code is long and ugly, I haven't gotten around to cleaning it up. It is a C# extension, so you would use it like:

对不起,代码又长又丑,我还没来得及清理。它是一个c#扩展,所以您可以使用它:

string name = "Miss Jessica Dark-Angel Alba";
string prefix, first, middle, last, suffix;
name.ParseName(out prefix, out first, out middle, out last, out suffix);

#4


5  

You probably don't need to do anything fancy really. Something like this should work.

你可能不需要做什么特别的事情。像这样的东西应该有用。

    Name = Name.Trim();

    arrNames = Name.Split(' ');

    if (arrNames.Length > 0) {
        GivenName = arrNames[0];
    }
    if (arrNames.Length > 1) {
        FamilyName = arrNames[arrNames.Length - 1];
    }
    if (arrNames.Length > 2) {
        MiddleName = string.Join(" ", arrNames, 1, arrNames.Length - 2);
    }

You may also want to check for titles first.

您可能还想先检查标题。

#5


4  

I had to do this. Actually, something much harder than this, because sometimes the "name" would be "Smith, John" or "Smith John" instead of "John Smith", or not a person's name at all but instead a name of a company. And it had to do it automatically with no opportunity for the user to correct it.

我必须这样做。实际上,比这更难的是,因为有时候“名字”会是“史密斯,约翰”或“史密斯约翰”而不是“约翰史密斯”,或者根本不是一个人的名字,而是一个公司的名字。它必须自动完成,用户没有机会纠正它。

What I ended up doing was coming up with a finite list of patterns that the name could be in, like:
Last, First Middle-Initial
First Last
First Middle-Initial Last
Last, First Middle
First Middle Last
First Last

我最后做的是列出一个有限的模式列表,名字可以在其中,比如:Last, First Middle- First First First First First First Last, Last, Last, Last, Middle- First, First Middle- First, Middle- Middle, Last, Last, Last, Last

Throw in your Mr's, Jr's, there too. Let's say you end up with a dozen or so patterns.

再加上你先生的,Jr的。假设你最终得到了十几种模式。

My application had a dictionary of common first name, common last names (you can find these on the web), common titles, common suffixes (jr, sr, md) and using that would be able to make real good guesses about the patterns. I'm not that smart, my logic wasn't that fancy, and yet still, it wasn't that hard to create some logic that guessed right more than 99% of the time.

我的应用程序有一个通用名、通用名(您可以在web上找到这些)、通用名、通用后缀(jr、sr、md)的字典,使用它可以对模式进行真正的猜测。我没那么聪明,我的逻辑也没那么花哨,但即便如此,创造一些在99%的时间里都能猜对的逻辑也没那么难。

#6


3  

I appreciate that this is hard to do right - but if you provide the user a way to edit the results (say, a pop-up window to edit the name if it didn't guess right) and still guess "right" for most cases... of course it's the guessing that's tough.

我很欣赏这很难做到正确——但如果你为用户提供一种编辑结果的方式(比如,如果用户猜错了,就会弹出一个窗口来编辑名字),并且在大多数情况下仍然猜对了……当然,这是一种艰难的猜测。

It's easy to say "don't do it" when looking at the problem theoretically, but sometimes circumstances dictate otherwise. Having fields for all the parts of a name (title, first, middle, last, suffix, just to name a few) can take up a lot of screen real estate - and combined with the problem of the address (a topic for another day) can really clutter up what should be a clean, simple UI.

从理论上看问题时,我们很容易说“不要做”,但有时情况并非如此。有字段的所有部件名称(标题,首先,中间,最后,后缀,仅举几例)可以占用大量的屏幕空间,结合问题的解决(另一个主题)可以真正使杂乱应该干净、简单的UI。

I guess the answer should be "don't do it unless you absolutely have to, and if you do, keep it simple (some methods for this have been posted here) and provide the user the means to edit the results if needed."

我想答案应该是“除非你必须这样做,否则不要这样做,如果你这样做了,保持简单(这里已经发布了一些方法),并为用户提供编辑结果的方法(如果需要的话)。”

#7


3  

Understanding this is a bad idea, I wrote this regex in perl - here's what worked the best for me. I had already filtered out company names.
Output in vcard format: (hon_prefix, given_name, additional_name, family_name, hon. suffix)

理解这一点不是一个好主意,我用perl编写了这个regex—以下是对我最有效的方法。我已经过滤掉了公司的名字。vcard格式的输出:(hon_prefix、given_name、additional_name、family_name、hon. suffix)

/^ \s*
    (?:((?:Dr.)|(?:Mr.)|(?:Mr?s.)|(?:Miss)|(?:2nd\sLt.)|(?:Sen\.?))\s+)? # prefix
    ((?:\w+)|(?:\w\.)) # first name
(?: \s+ ((?:\w\.?)|(?:\w\w+)) )?  # middle initial
(?: \s+ ((?:[OD]['’]\s?)?[-\w]+))    # last name
(?: ,? \s+ ( (?:[JS]r\.?) | (?:Esq\.?) | (?: (?:M)|(?:Ph)|(?:Ed) \.?\s*D\.?) | 
         (?: R\.?N\.?) | (?: I+) )  )? # suffix
\s* $/x

notes:

注:

  • doesn't handle IV, V, VI
  • 不处理IV, V, VI。
  • Hard-coded lists of prefixes, suffixes. evolved from dataset of ~2K names
  • 硬编码的前缀,后缀。由~2K个名称的数据集演化而来
  • Doesn't handle multiple suffixes (eg. MD, PhD)
  • 不处理多个后缀(如。医学博士)
  • Designed for American names - will not work properly on romanized Japanese names or other naming systems
  • 为美国人的名字而设计-将不能在罗马化的日本名字或其他命名系统上正常工作

#8


2  

There are a few add-ins we have used in our company to accomplish this. I ended up creating a way to actually specify the formats for the name on our different imports for different clients. There is a company that has a tool that in my experience is well worth the price and is really incredible when tackling this subject. It's at: http://www.softwarecompany.com/ and works great. The most efficient way to do this w/out using any statistical approach is to split the string by commas or spaces then: 1. strip titles and prefixes out 2. strip suffixes out 3, parse name in the order of ( 2 names = F & L, 3 names = F M L or L M F) depending on order of string().

我们公司已经使用了一些插件来完成这个任务。最后我创建了一种方法来实际指定不同客户端的不同导入的名称的格式。在我的经验中,有一家公司的工具非常物有所值,而且在处理这个问题的时候真的是难以置信。网址是:http://www.softwarecompany.com/,效果很好。使用任何统计方法来做这个w/out的最有效的方法是用逗号或空格来分隔字符串:1。去掉标题和前缀2。带后缀3,解析名的顺序(2个名称= F & L, 3个名称= F M L或L M F)取决于字符串()的顺序。

#9


2  

The real solution here does not answer the question. The portent of the information must be observed. A name is not just a name; it is how we are known.

真正的解决方案并不能回答这个问题。必须遵守信息的预示。名字不只是名字;这就是我们为人所知的方式。

The problem here is not knowing exactly what parts are labled what, and what they are used for. Honorable prefixes should be granted only in personal corrospondences; Doctor is an honorific that is derived from a title. All information about a person is relavent to their identity, it is just determining what is relavent information. You need a first and last name for reasons of administration; phone number, email addresses, land descriptions and mailing addresses; all to the portent of identity, knowing who you are dealing with.

这里的问题是不知道具体的零件是什么,它们的用途是什么。荣誉的前缀只在个人的证据中被授予;医生是源自头衔的敬语。关于一个人的所有信息都与他们的身份有关,它只是决定什么是相对信息。由于行政原因,你需要一个姓和名;电话号码、电邮地址、土地说明及邮寄地址;所有的一切都预示着身份,知道你在和谁打交道。

The real problem here is that the person gets lost in the administration. All of a sudden, after only entering their personal information into a form and submitting it to an arbitrary program for processing, they become afforded all sorts of honorifics and pleasentries spewed out by a prefabricated template. This is wrong; honorable Sir or Madam, if personal interest is shown toward the reason of corrospondence, then a letter should never be written from a template. Personal corrospondance requires a little knowledge about the recipient. Male or female, went to school to be a doctor or judge, what culture in which they were raised.

真正的问题是这个人在管理中迷失了方向。突然之间,他们只把自己的个人信息输入到一个表格中,然后提交给一个任意的程序进行处理,就会被一个预先制作好的模板提供给他们各种各样的荣誉和欢乐。这是错误的;尊敬的先生或女士,如果个人利益被证明是出于确证的原因,那么一封信函不应该从模板上书写。个人需要对接受者有一点了解。男性或女性,去学校当医生或法官,他们是在什么文化中长大的。

In other cultures, a name is made up from a variable number of characters. The person's name to us can only be interpretted as a string of numbers where the spaces are actually determined by character width instead of the space character. Honorifics in these cases are instead one or many characters prefixing and suffixing the actual name. The polite thing to do is use the string you are given, if you know the honorific then by all means use it, but this again implies some sort of personal knowledge of the recipient. Calling Sensei anything other than Sensei is wrong. Not in the sense of a logic error, but in that you have just insulted your caller, and now you should find a template that helps you apologize.

在其他文化中,名字是由不同数量的字符组成的。对我们来说,人名只能解释为一串数字,其中的空格实际上是由字符宽度决定的,而不是空格字符。在这些情况下,Honorifics是一个或多个字符前缀和后缀的实际名称。礼貌的做法是使用你所得到的字符串,如果你知道敬语,那么一定要使用它,但这也意味着你对接受者有某种个人的了解。除了唤醒以外,调用唤醒都是错误的。不是逻辑上的错误,而是你刚刚侮辱了你的来电者,现在你应该找到一个模板来帮助你道歉。

For the purposes of automated, impersonal corrospondence, a template may be devised for such things as daily articles, weekly issues or whatever, but the problem becomes important when corrospondence is instigated by the recipient to an automated service.

为了实现自动化、非人情味的腐蚀,可以为日常文章、每周问题或其他任何事情设计一个模板,但是当腐蚀被接受者鼓动到自动化服务中时,这个问题就变得很重要了。

What happens is an error. Missing information. Unknown or missing information will always generate an Exception. The real problem is not how do you seperate a person's name into its seperate components with an expression, but what do you call them.

发生的是一个错误。丢失的信息。未知或缺失的信息总是会产生异常。真正的问题不在于如何用表达式将一个人的名字分成不同的部分,而在于如何称呼他们。

The solution is to create an extra field, make it optional if there is already a first and last name, and call it "What may we call you" or "How should we refer to you as". A doctor and a judge will ensure you address them properly. These are not programming issues, they are issues of communication.

解决方案是创建一个额外的字段,如果已经有了第一个和最后一个名字,则将其设置为可选,并将其命名为“我们可以叫你什么”或“我们应该如何称呼你为”。医生和法官会确保你能恰当地称呼他们。这些不是编程问题,而是沟通问题。

Ok, bad way to put it, but in my opinion, Username, Tagname, and ID are worse. So my solution; is the missing question, "What should we call you?"

不好的说法,但在我看来,用户名,标签名和ID更糟糕。所以我的解决方案;“我们该叫你什么?”

This is only a solution where you can afford to make a new question. Tact prevails. Create a new field upon your user forms, call it Alias, label for the user "What should we call you?", then you have a means to communicate with. Use the first and last name unless the recipient has given an Alias, or is personally familiar with the sender then first and middle is acceptable.

这只是一个你可以提出一个新问题的解决方案。机智盛行。在您的用户表单上创建一个新字段,调用它Alias,为用户标记“我们应该怎么称呼您?”然后你就有了沟通的方式。使用第一个和最后一个名字,除非收件人提供了一个别名,或者个人对发件人很熟悉,否则可以使用第一个和中间的名字。

To Me, _______________________ (standard subscribed corrospondence)
To Me ( Myself | I ), ________ (standard recipient instigated corrospondence)
To Me Myself I, ______________ (look out, its your mother, and you're in big trouble;
                                nobody addresses a person by their actual full name)

Dear *(Mr./Mrs./Ms./Dr./Hon./Sen.) Me M. I *(I),
To Whom it may Concern;

Otherwise you are looking for something standard: hello, greetings, you may be a winner.

否则,你在寻找一些标准的东西:你好,问候,你可能是一个赢家。

Where you have data that is a person's name all in one string, you don't have a problem because you already have their alias. If what you need is the first and last name, then just Left(name,instr(name," ")) & " " & Right(name,instrrev(name," ")), my math is probably wrong, i'm a bit out of practice. compare left and right with known prefixes and suffixes and eliminate them from your matches. Generally the middle name is rarely used except for instances of confirming an identity; which an address or phone number tells you a lot more. Watching for hyphanation, one can determine that if the last name is not used, then one of the middle ones would be instead.

如果数据是一个字符串中的一个人的名字,你就不会有问题,因为你已经有了他们的别名。如果你需要的是第一个和最后一个名字,然后是左(name,instr(name," ")) & Right(name,提高妇女地位),我的计算可能是错误的,我有点不熟练。将左和右的前缀和后缀与已知的前缀和后缀进行比较,并从匹配中删除它们。通常,除了确认身份的实例外,很少使用中间名;哪个地址或电话号码告诉你更多。观察hyphanation,可以确定如果不使用姓氏,那么中间的一个就会被使用。

For searching lists of first and last names, one must consider the possibility that one of the middle ones was instead used; this would require four searches: one to filter for first & last, then another to filter first & middle, then another to filter middle & last, and then another to filter middle & middle. Ultimately, the first name is always first, and the last is always last, and there can be any number of middle names; less is more, and where zero is likely, but improbable.

对于搜索名字和名字的列表,必须考虑中间的一个可能被使用;这将需要四次搜索:一次过滤第一和最后,另一次过滤第一和中间,另一次过滤中间和最后,然后再过滤中间和中间。最后,名字总是第一个,最后一个总是最后一个,中间的名字可以有任意数量;少即是多,有可能是零,但不太可能。

Sometimes people prefer to be called Bill, Harry, Jim, Bob, Doug, Beth, Sue, or Madonna; than their actual names; similar, but unrealistically expected of anyone to fathom all the different possibilities.

有时人们喜欢被叫做比尔、哈里、吉姆、鲍勃、道格、贝丝、苏或麦当娜;比他们的实际名称;相似的,但不切实际的期望任何人都能发现所有不同的可能性。

The most polite thing you could do, is ask; What can we call you?

你能做的最礼貌的事,就是问;我们可以叫你什么?

#10


1  

You can do the obvious things: look for Jr., II, III, etc. as suffixes, and Mr., Mrs., Dr., etc. as prefixes and remove them, then first word is first name, last word is last name, everything in between are middle names. Other than that, there's no foolproof solution for this.

您可以做一些显而易见的事情:寻找Jr、II、III等作为后缀,Mr、Mrs、Dr等作为前缀并删除,然后第一个单词是名,最后一个单词是姓,中间的所有东西都是中间名。除此之外,没有万无一失的解决办法。

A perfect example is David Lee Roth (last name: Roth) and Eddie Van Halen (last name: Van Halen). If Ann Marie Smith's first name is "Ann Marie", there's no way to distinguish that from Ann having a middle name of Marie.

一个完美的例子是大卫·李·罗斯(姓氏:罗斯)和埃迪·范·哈伦(姓氏:范·哈伦)。如果安·玛丽·史密斯的第一个名字是“安·玛丽”,那就没有办法区别出安有一个中间名的玛丽。

#11


1  

If you simply have to do this, add the guesses to the UI as an optional selection. This way, you could tell the user how you parsed the name and let them pick a different parsing from a list you provide.

如果您只需要这样做,那么将猜测添加到UI作为可选选项。通过这种方式,您可以告诉用户如何解析名称,并让他们从您提供的列表中选择不同的解析。

#12


1  

There is a 3rd party tool for this kind of thing called NetGender that works surprisingly well. I used it to parse a massive amount of really mal-formed names in unpredictable formats. Take a look at the examples on their page, and you can download and try it as well.

有一种第三方工具,叫做NetGender,它非常好用。我用它来解析大量格式错误的名字,格式难以预测。看看他们页面上的例子,你可以下载并尝试一下。

http://www.softwarecompany.com/dotnet/netgender.htm

http://www.softwarecompany.com/dotnet/netgender.htm

I came up with these statistics based on a sampling of 4.2 million names. Name Parts means the number of distinct parts separated by spaces. A very high percentage were correct for most names in the database. The correctness went down as the parts went up, but there were very few names with >3 parts and fewer with >4. This was good enough for our case. Where the software fell down was recognizing not-well-known multi-part last names, even when separated by a comma. If it was able to decipher this, then the number of mistakes in total would have been less than 1% for all data.

我根据420万个名字的样本得出了这些统计数据。名称部分是由空格分隔的不同部分的数量。对于数据库中的大多数名称,正确的比例非常高。随着零件数量的增加,正确率下降,但是>3零件的名字很少,>4零件的名字更少。这对我们的案子来说已经足够了。软件崩溃的地方是识别不知名的多部分姓,即使用逗号分隔。如果它能破译这个,那么所有数据的错误总数将不到1%。

Name Parts | Correct | Percent of Names in DB
 2             100%       48%
 3              98%       42%
 4              70%        9%
 5              45%        0.25%

#13


1  

I already do this server-side on page load. Wrote a Coldfusion CFC that gets two params passed to it - actual user data(first name, middle, last name) and data type(first,middle,last). Then routine checks for hyphens, apostrophes, spaces and formats accordingly. ex. MacDonald, McMurray, O'Neill, Rodham-Clinton, Eric von Dutch, G. W. Bush, Jack Burton Jr., Dr. Paul Okin, Chris di Santos. For case where users only have one name, only the first name field is required, middle and last names are optional.

我已经在页面负载上做了这个服务器端。编写了一个Coldfusion CFC,它获取传递给它的两个参数——实际的用户数据(名、中间、姓)和数据类型(第一、中间、最后)。然后对连字符、撇号、空格和格式进行常规检查。麦克唐纳、麦克默里、奥尼尔、罗汉-克林顿、埃里克·冯·荷兰、乔治·w·布什、小杰克·伯顿、保罗·奥金博士、克里斯·迪·桑托斯。对于用户只有一个名称的情况,只需要第一个名称字段,中间和最后一个名称是可选的。

All info is stored lowercase - except Prefix, Suffix and Custom. This formatting is done on page render, not during store to db. Though there is validation filtering when user inputs data. Sorry, cannot post code. Started out using Regex but became too confusing and unreliable for all scenarios. Used standard logic blocks(if/else, switch/case), easier to read and debug. MAKE EVERY INPUT/DB FIELD SEPARATE! Yes, this will take some coding, but after you are finished it should account for 99% of combinations. Only based on English names so far, no internationalization, that's another ball of wax.

所有信息都存储为小写——除了前缀、后缀和习惯。这种格式是在页面呈现时完成的,而不是在存储到db时。虽然在用户输入数据时存在验证过滤。对不起,不能邮政编码。一开始使用Regex,但对所有场景都变得太过混乱和不可靠。使用标准逻辑块(if/else、switch/case),更容易阅读和调试。使每个输入/DB字段分开!是的,这将需要一些编码,但在您完成之后,它应该占99%的组合。到目前为止,只基于英语名称,没有国际化,那是另一个蜡球。

Here's some things to consider:

这里有一些事情需要考虑:

  • Hypens (ex. Rodham-Clinton, could be in first, middle or last)
  • Hypens(前国务卿罗丹-克林顿,可以是第一、中、末)
  • Apostrophes (ex. O'Neill, could be in first, middle or last)
  • 撇号(例如O . neill)可以在第一,中间或者最后)
  • Spaces
  • 空间
  • Mc and Mac (ex. McDonald, MacMurray, could be in first, middle or
    last)
  • Mc和Mac(例如,麦当劳,MacMurray,可能是第一,中,也可能是最后)
  • First names: multiple first names (ex. Joe Bob Briggs)
  • 名字:多重名字(前乔·鲍勃·布里格斯)
  • Last names: de,di,et,der,den,van,von,af should be lowercase (ex Eric von Dander, Mary di Carlo)
  • 姓:de,di,et,der,den,van,von,af应该是小写的(前Eric von Dander, Mary di Carlo)
  • Prefix: Dr., Prof., etc
  • 前缀:博士,教授,等等
  • Suffix: Jr., Sr., Esq., II, III, etc
  • 老后缀:Jr .)收。,二,三,等等

When user enters info, field schema in db is like so:

当用户输入信息时,db中的字段模式是这样的:

  • Prefix/Title (Dr., etc using a dropdown)
  • 前缀/标题(Dr.等使用下拉)
  • Prefix/Title Custom (user can enter custom, ex. Capt. using a text field)
  • 前缀/标题自定义(用户可以使用文本字段输入Custom,例如capt)
  • First Name
  • 第一个名字
  • Middle
  • 中间
  • Last Name
  • Suffix (Jr., III, Prof., Ret., etc using a dropdown)
  • 后缀(Jr.、III、Prof.、Ret等,使用下拉)
  • Suffix Custom (user can enter custom, ex. CPA)
  • 后缀自定义(用户可以输入自定义,例如CPA)

Here's the one Regex I do use to make first letter of each name uppercase. I run this first, then following routines format according to rules(it's in Coldfusion format but you get the idea):

这是我用来制作每个大写字母的第一个正则表达式。我先运行这个,然后按照规则执行例程格式(它是Coldfusion格式,但是你懂的):

<cfset var nameString = REReplace(LCase(nameString), "(^[[:alpha:]]|[[:blank:]][[:alpha:]])", "\U\1\E", "ALL")>

You could also do this client-side using JavaScript and CSS - might even be easier - but I prefer to do server-side since I need the variables set before page loads client-side.

您还可以使用JavaScript和CSS在客户端完成这个任务——这可能更简单——但我更喜欢做服务器端,因为我需要在页面加载客户端之前设置变量。

#14


0  

I would say Strip out salutations from a list then split by space, placing list.first() as first name, list.last() as last name then join the remainder by a space and have that as a middle name. And ABOVE ALL display your results and let the user modify them!

我将从列表中删除问候语,然后按空格分隔,将list.first()作为第一个名称,list.last()作为最后一个名称,然后用空格连接其余的名称,并将其作为中间名称。最重要的是显示你的结果并让用户修改它们!

#15


0  

Sure, there is a simple solution - split the string by spaces, count the number of tokens, if there is 2, interpret them to be FIRST and LAST name, if there is 3, interpret it to be FIRST, MIDDLE, and LAST.

当然,有一个简单的解决方案——按空格分割字符串,计数令牌数量,如果有2个,将它们解释为姓和名,如果有3个,将其解释为第一个、中间和最后一个。

The problem is that the simple solution will not be a 100% correct solution - someone could always enter a name with many more tokens, or could include titles, last names with a space in it (is this possible?), etc. You can come up with a solution that works for most names most of the time, but not an absolute solution.

问题是简单的解决方案不会100%正确的解决方案——人总是可以用更多的令牌,输入一个名称或可能包括标题、姓氏有一个空间(这是可能的吗?),等等。你可以想出一个解决方案,适用于大多数的名字大多数时候,但不是绝对的解决方案。

I would follow Shad's recommendation to split the input fields.

我将按照Shad的建议分割输入字段。

#16


0  

You don't want to do this, unless you are only going to be contacting people from one culture.

你不想这么做,除非你只是想接触来自同一种文化的人。

For example:

例如:

Guido van Rossum's last name is van Rossum.

Guido van Rossum的姓是van Rossum。

MIYAZAKI Hayao's first name is Hayao.

宫崎骏的名字叫Hayao。

The most success you could do is to strip off common titles and salutations, and try some heuristics.

你能做的最成功的事就是去掉常见的头衔和称呼,然后尝试一些启发式。

Even so, the easiest solution is to just store the full name, or ask for given and family name seperately.

即便如此,最简单的解决方案是只存储全名,或者分别请求给定和姓。

#17


0  

This is a fools errand. Too many exceptions to be able to do this deterministically. If you were doing this to pre-process a list for further review I would contend that less would certainly be more.

这是愚蠢的错误。有太多的例外情况无法确定地做到这一点。如果你这样做是为了预先处理一个列表,以便进行进一步的检查,我认为少肯定是多。

  1. Strip out salutations, titles and generational suffixes (big regex, or several small ones)
  2. 去掉称呼、头衔和代际后缀(大regex或几个小的)
  3. if only one name, it is 'last'.
  4. 如果只有一个名字,那就是“last”。
  5. If only two names split them first,last.
  6. 如果只有两个名字把他们分开,最后。
  7. If three tokens and middle is initial split them first, middle, last
  8. 如果三个令牌和中间是初始的,首先,中间,最后
  9. Sort the rest by hand.
  10. 其余的用手分类。

Any further processing is almost guaranteed to create more work as you have to go through recombining what your processing split-up.

任何进一步的处理几乎都可以保证创建更多的工作,因为您必须重新组合您的处理分解。

#18


0  

I agree, there's no simple solution for this. But I found an awful approach in a Microsoft KB article for VB 5.0 that is an actual implementation to much of the discussion talked about here: http://support.microsoft.com/kb/168799

我同意,没有简单的解决办法。但是我在为VB 5.0撰写的一篇Microsoft KB文章中发现了一种糟糕的方法,它实际上实现了本文讨论的大部分内容:http://support.microsoft.com/kb/168799

Something like this could be used in a pinch.

像这样的东西可以在必要时使用。

#19


0  

There is no 100% way to do this.

没有100%的方法可以做到这一点。

You can split on spaces, and try to understand the name all you want, but when it comes down to it, you will get it wrong sometimes. If that is good enough, go for any of the answers here that give you ways to split.

你可以在空格上分开,试着去理解你想要的名字,但是当它出现的时候,你有时会出错。如果这足够好,那就去找任何能让你分裂的答案。

But some people will have a name like "John Wayne Olson", where "John Wayne" is the first name, and someone else will have a name like "John Wayne Olson" where "Wayne" is their middle name. There is nothing present in that name that will tell you which way to interpret it.

但有些人会有“约翰·韦恩·奥尔森”(John Wayne Olson)这样的名字,“约翰·韦恩”是他们的名字,而另一些人会有“约翰·韦恩·奥尔森”这样的名字,“韦恩”是他们的中间名。这个名字里没有任何东西能告诉你该怎么解释它。

That's just the way it is. It's an analogue world.

事情就是这样。这是一个模拟世界。

My rules are pretty simple.

我的规则很简单。

Take the last part --> Last Name
If there are multiple parts left, take the last part --> Middle name
What is left --> First name

最后一部分——>的姓,如果有多个部分,取最后一部分——>中间名——>名。

But don't assume this will be 100% accurate, nor will any other hardcoded solution. You will need to have the ability to let the user edit this him/her-self.

但是不要假设这是100%准确的,也不会有任何其他硬编码的解决方案。您将需要有能力让用户编辑这个他/她自己。

#20


0  

I did something similar. The main problem I had was when people entered stuff like "Richard R. Smith, Jr." I posted my code at http://www.blackbeltcoder.com/Articles/strings/splitting-a-name-into-first-and-last-names. It's in C# but could easily be converted to VB.

我做了一些类似的。我遇到的主要问题是当人们输入诸如“小理查德·r·史密斯”之类的东西时。我将代码发布在http://www.blackbeltcoder.com/articles/strings/spliting -a-name-in - to-first- last-name。它在c#中,但是可以很容易地转换为VB。

#21


-1  

I agree with not to do this . The name Rick Van DenBoer would end up with a middle name of Van but it's part of the last name.

我同意不做这件事。Rick Van DenBoer这个名字最后会有一个中间名Van,但它是姓的一部分。

#22


-1  

SEE MORE DISCUSSION (almost exactly 1 year ago):
http://discuss.joelonsoftware.com/default.asp?design.4.551889.41

查看更多的讨论(几乎一年前):http://discuss.joelonsoftware.com/default.asp?design.4.551889.41