将一个人的名字正常化过头了吗?

时间:2022-09-13 10:58:37

You usually normalize a database to avoid data redundancy. It's easy to see in a table full of names that there is plenty of redundancy. If your goal is to create a catalog of the names of every person on the planet (good luck), I can see how normalizing names could be beneficial. But in the context of the average business database is it overkill?

通常对数据库进行规范化以避免数据冗余。在满是名称的表中很容易看到有很多冗余。如果您的目标是创建一个关于地球上每个人的名字的目录(祝您好运),我可以看到规范化名字是多么有益。但在一般业务数据库的上下文中,它是否有些过头呢?

(Of course I know you could take anything to an extreme... say if you normalized down to syllables... or even adjacent character pairs. I can't see a benefit in going that far)

(我当然知道你可以把任何事情做得极端……)如果你把音节标准化……甚至是相邻的字符对。我看不出走那么远有什么好处

Update:

更新:

One possible justification for this is a random name generator. That's all I could come up with off the top of my head.

一个可能的理由是随机名称生成器。这就是我脑子里能想到的。

19 个解决方案

#1


35  

Database normalization usually refers to normalizing the field, not its content. In other words, you would normalize that there only be one first name field in the database. That is generally worthwhile. However the data content should not be normalized, since it is individual to that person - you are not picking from a list, and you are not changing a list in one place to affect everybody - that would be a bug, not a feature.

数据库规范化通常指对字段进行规范化,而不是对其内容进行规范化。换句话说,您将规范化数据库中只有一个名字段。那通常是值得的。然而,数据内容不应该被规范化,因为它是针对那个人的——您不是从列表中选择,也不是在一个地方更改列表以影响每个人——这将是一个bug,而不是一个特性。

#2


53  

Yes, it's an overkill.

是的,这是一个过度。

People don't change their names from Bill to Joe all at once.

人们不会一下子把他们的名字从比尔改到乔。

#3


5  

How do you normalize a name? Not all names have the same structure. Not all countries or cultures use the same rules for names. A first name is not necessarily just a first name. People have variable numbers of names. Some countries don't have the simple pair of firstname/lastname. What if my first name just so happens to be your last name, should they be considered the same in your database? If not, then you get into the problem that last name might mean different things in different countries. In most countries I know of, it is a family name. Your last name is the same as at least one of your parents' last name. On Iceland, it is your father's first name, followed by "son" or "daughter". So the same last name will mean completely different things depending on whether you encounter it in Iceland and the US.

如何规范化一个名字?不是所有的名字都有相同的结构。并非所有的国家或文化都使用相同的名字规则。第一个名字不一定就是第一个名字。人们有不同的名字。有些国家没有一对简单的姓/名。如果我的名字恰好是你的姓,那么在你的数据库中它们应该被认为是相同的吗?如果不是,那么你就会陷入这样的问题:姓氏在不同国家可能意味着不同的东西。在我所知道的大多数国家,它是一个姓。你的姓至少和你父母的姓一样。在冰岛,这是你父亲的名字,其次是“儿子”或“女儿”。同样的姓氏意味着完全不同的东西,这取决于你在冰岛和美国是否遇到它。

In some cultures it is common when getting married, for the woman to take her husband's last name. In other cultures, that's completely optional, or might even work the opposite way.

在某些文化中,结婚是很常见的,因为女人要取她丈夫的姓。在其他文化中,这是完全可选的,甚至可能适得其反。

How can you normalize this? What information would it gain you? If you find someone in your database who has "Smith" as the last word making up their name, what does that tell you? It might not be their family name. It might only be part of the family name. It might be an honorary in some language, but which according to their culture, should be considered part of the name.

如何将其规范化呢?它会给你带来什么信息?如果你在你的数据库中找到一个“Smith”作为最后一个单词的名字,这告诉你什么?可能不是他们的姓。它可能只是家族姓氏的一部分。它可能是某种语言的荣誉,但根据他们的文化,应该被认为是名字的一部分。

You can only normalize data if it follows a common structure.

只有遵循公共结构的数据才能规范化。

#4


2  

Yes, definitely overkill. What's a few dozen bytes betewen friends?

是的,确实过分了。几个字节的朋友是什么意思?

#5


2  

Maybe if you work in the Census office it might make sense. Otherwise, see every other answer :)

如果你在人口普查局工作,这可能有意义。否则,看看其他的答案:)

#6


1  

I would say yes, it is going too far in 95%+ of the cases.

我想说,是的,95%以上的情况都太过分了。

#7


1  

Yes. I cannot think of an instance where the benefits outweigh the problems and query complications.

是的。我想不出哪个实例的好处超过了问题和查询复杂性。

#8


1  

No, but you might want to normalise to a canonical record for a customer (so you don't get 5 different entries for 'Bloggs & Co.' in your database. This is a data cleansing issue that often bites on MIS projects.

不,但是您可能想要规范化为客户的规范记录(这样您就不会在数据库中获得5个不同的“Bloggs & Co”条目。这是一个数据清理问题,经常影响到MIS项目。

#9


1  

You often don't go over fourth form normalization in a database. Therefore seventh form normalization is quite a bit overboard. The only place this might even be a remotely plausible idea is in some kind of massive data warehouse.

在数据库中,通常不会进行第四种形式的规范化。因此第七种形式的归一化有点过头了。这甚至可能是一个看似合理的想法,唯一的地方是在某种庞大的数据仓库中。

#10


1  

Generally yes. Normalizing to that level would be going to far. Depending on the queries (such as phone books where searches by last name are common) it might be worthwhile. I expect that to be rare.

通常是的。正常化到这一水平将会走得很远。根据查询(如使用姓氏搜索的电话簿),这可能是值得的。我希望这是罕见的。

#11


1  

If you had a need to perform queries based on diminutive names I could see a need for normalizing the names. e.g. a search for "Betty" may need to return results for "Betty", "Beth", and "Elizabeth"

如果您需要基于小的名称执行查询,我可以看到需要对名称进行规范化。例如,搜索“Betty”可能需要返回“Betty”、“Beth”和“Elizabeth”的搜索结果

#12


0  

I generally haven't seen a need to normalize the name, mainly because that adds a performance hit on the join that will always be called, and doesn't give any benefit.

我通常没有看到需要规范化名称,主要是因为这增加了总是被调用的连接的性能影响,并且没有带来任何好处。

If you have so many similar names, and have a storage problem then it may be worth it, but there will be a performance hit that would need to be considered.

如果您有如此多相似的名称,并且存在存储问题,那么它可能是值得的,但是需要考虑性能的影响。

#13


0  

I would say it is absolutely overkill. In most applications, you display folks' names so often, every query involved with that is going to look that much more complex and harder to read.

我得说这绝对是夸大其词。在大多数应用程序中,经常显示用户名,涉及到的每个查询看起来都要复杂得多,也更难读。

#14


0  

Yes, it is. It is commonly recognized that just applying all of the Rules of Normalization can cause you to go way too far and end up with an overnormalized database. For example, it would be possible to normalize every instance of every character to a reference to a character enumeration table. It's easy to see that that's ridiculous.

是的,它是。一般认为,仅仅应用所有的规范化规则就会使您走得太远,最终导致一个过度规范化的数据库。例如,可以将每个字符的每个实例规范化为对字符枚举表的引用。很容易看出那是荒谬的。

Normalization needs to be performed at a level that is appropriate for your problem domain. Overnormalization is as much a problem as undernormalization (although, of course, for different reasons).

规范化需要在适合您的问题域的级别上执行。过度归一化和不归一化一样是个问题(当然,由于不同的原因)。

#15


0  

There might be a case where being able to link married/maiden names would be useful.
Recently had a case where I had to rename thousands of emails in exchange because somebody got divorced and didn't want any emails listing her as married_name@company.com

可能有这样一种情况:能够将已婚/未婚姓名联系起来是有用的。最近有一个案例,我不得不重命名数千封电子邮件作为交换,因为有人离婚了,不想要任何将她列为married_name@company.com的电子邮件

#16


0  

No need to normalize to that level unless the names make up a composite primary key and you have data that is dependant on one of the names (e.g. anyone with the surname Plummer knows nothing about databases). In which case, by not normalizing, you would violate second normal form.

不需要规范化到这个级别,除非名称组成一个复合主键,并且您有依赖于其中一个名称的数据(例如,任何姓Plummer的人对数据库一无所知)。在这种情况下,如果不正常,就会违反第二范式。

#17


0  

I agree with the general response, you wouldn't do that.

我同意大家的回答,你不会那样做的。

One thing comes to mind though, compression. If you had a billion people and you found that 60% of first names were pulled from 5 very common names, you could use some tricky bit manipulation to reduce the size very significantly. It would also require very customized database software.

但有一件事是要注意的,那就是压缩。如果你有10亿人,你发现60%的名字是从5个非常常见的名字中抽取出来的,你可以使用一些巧妙的位操作来显著地减少名字的大小。它还需要非常定制的数据库软件。

But this isn't for the purpose of normalization, just compression.

但这不是为了标准化,只是为了压缩。

#18


0  

You should normalize it out if you need to avoid the delete anomaly that comes with not breaking it out. That is, if you ever need to answer the question, has my database ever had a person named "Joejimbobjake" in it, you need to avoid the anomaly. Soft deletes is probably a much better way than having a comprehensive first name table (for example), but you get my point.

如果您需要避免由于没有中断而带来的删除异常,则应该对其进行规范化。也就是说,如果你需要回答这个问题,我的数据库曾经有一个叫“Joejimbobjake”的人,你需要避免这个异常。软删除可能比拥有一个完整的first name表(例如)要好得多,但是您明白我的意思。

#19


0  

In addition to all the points everyone else has made, consider that if you were implementing a data entry operation (for example), and were to insert a new contact, you would have to search your first name and last name tables to locate the correct Id's and then use those values. But then this is further complicated by the occasion when the name is not on the FN and/or LN tables, then you have to insert the new first/last name and use the new id(s).

除了所有的点其他人了,考虑到如果你实现一个数据输入操作(例如),并插入一个新的接触,你会搜索你的名和姓表来定位正确的Id,然后使用这些值。但当名称不在FN和/或LN表上时,这就更加复杂了,然后必须插入新的姓/名并使用新的id。

And if you think that you have a comprehensive list of names, think again. I work with a list of over 200k unique first names and I'd guess it represents 99.9% of the US population. But that .1% = a lot of people. And don't forget the foreign names and misspellings...

如果你认为你有一个完整的名字列表,再想想。我使用了超过200k的唯一名字列表,我猜它代表了99.9%的美国人口。但那。1% =很多人。别忘了外号和拼写错误……

#1


35  

Database normalization usually refers to normalizing the field, not its content. In other words, you would normalize that there only be one first name field in the database. That is generally worthwhile. However the data content should not be normalized, since it is individual to that person - you are not picking from a list, and you are not changing a list in one place to affect everybody - that would be a bug, not a feature.

数据库规范化通常指对字段进行规范化,而不是对其内容进行规范化。换句话说,您将规范化数据库中只有一个名字段。那通常是值得的。然而,数据内容不应该被规范化,因为它是针对那个人的——您不是从列表中选择,也不是在一个地方更改列表以影响每个人——这将是一个bug,而不是一个特性。

#2


53  

Yes, it's an overkill.

是的,这是一个过度。

People don't change their names from Bill to Joe all at once.

人们不会一下子把他们的名字从比尔改到乔。

#3


5  

How do you normalize a name? Not all names have the same structure. Not all countries or cultures use the same rules for names. A first name is not necessarily just a first name. People have variable numbers of names. Some countries don't have the simple pair of firstname/lastname. What if my first name just so happens to be your last name, should they be considered the same in your database? If not, then you get into the problem that last name might mean different things in different countries. In most countries I know of, it is a family name. Your last name is the same as at least one of your parents' last name. On Iceland, it is your father's first name, followed by "son" or "daughter". So the same last name will mean completely different things depending on whether you encounter it in Iceland and the US.

如何规范化一个名字?不是所有的名字都有相同的结构。并非所有的国家或文化都使用相同的名字规则。第一个名字不一定就是第一个名字。人们有不同的名字。有些国家没有一对简单的姓/名。如果我的名字恰好是你的姓,那么在你的数据库中它们应该被认为是相同的吗?如果不是,那么你就会陷入这样的问题:姓氏在不同国家可能意味着不同的东西。在我所知道的大多数国家,它是一个姓。你的姓至少和你父母的姓一样。在冰岛,这是你父亲的名字,其次是“儿子”或“女儿”。同样的姓氏意味着完全不同的东西,这取决于你在冰岛和美国是否遇到它。

In some cultures it is common when getting married, for the woman to take her husband's last name. In other cultures, that's completely optional, or might even work the opposite way.

在某些文化中,结婚是很常见的,因为女人要取她丈夫的姓。在其他文化中,这是完全可选的,甚至可能适得其反。

How can you normalize this? What information would it gain you? If you find someone in your database who has "Smith" as the last word making up their name, what does that tell you? It might not be their family name. It might only be part of the family name. It might be an honorary in some language, but which according to their culture, should be considered part of the name.

如何将其规范化呢?它会给你带来什么信息?如果你在你的数据库中找到一个“Smith”作为最后一个单词的名字,这告诉你什么?可能不是他们的姓。它可能只是家族姓氏的一部分。它可能是某种语言的荣誉,但根据他们的文化,应该被认为是名字的一部分。

You can only normalize data if it follows a common structure.

只有遵循公共结构的数据才能规范化。

#4


2  

Yes, definitely overkill. What's a few dozen bytes betewen friends?

是的,确实过分了。几个字节的朋友是什么意思?

#5


2  

Maybe if you work in the Census office it might make sense. Otherwise, see every other answer :)

如果你在人口普查局工作,这可能有意义。否则,看看其他的答案:)

#6


1  

I would say yes, it is going too far in 95%+ of the cases.

我想说,是的,95%以上的情况都太过分了。

#7


1  

Yes. I cannot think of an instance where the benefits outweigh the problems and query complications.

是的。我想不出哪个实例的好处超过了问题和查询复杂性。

#8


1  

No, but you might want to normalise to a canonical record for a customer (so you don't get 5 different entries for 'Bloggs & Co.' in your database. This is a data cleansing issue that often bites on MIS projects.

不,但是您可能想要规范化为客户的规范记录(这样您就不会在数据库中获得5个不同的“Bloggs & Co”条目。这是一个数据清理问题,经常影响到MIS项目。

#9


1  

You often don't go over fourth form normalization in a database. Therefore seventh form normalization is quite a bit overboard. The only place this might even be a remotely plausible idea is in some kind of massive data warehouse.

在数据库中,通常不会进行第四种形式的规范化。因此第七种形式的归一化有点过头了。这甚至可能是一个看似合理的想法,唯一的地方是在某种庞大的数据仓库中。

#10


1  

Generally yes. Normalizing to that level would be going to far. Depending on the queries (such as phone books where searches by last name are common) it might be worthwhile. I expect that to be rare.

通常是的。正常化到这一水平将会走得很远。根据查询(如使用姓氏搜索的电话簿),这可能是值得的。我希望这是罕见的。

#11


1  

If you had a need to perform queries based on diminutive names I could see a need for normalizing the names. e.g. a search for "Betty" may need to return results for "Betty", "Beth", and "Elizabeth"

如果您需要基于小的名称执行查询,我可以看到需要对名称进行规范化。例如,搜索“Betty”可能需要返回“Betty”、“Beth”和“Elizabeth”的搜索结果

#12


0  

I generally haven't seen a need to normalize the name, mainly because that adds a performance hit on the join that will always be called, and doesn't give any benefit.

我通常没有看到需要规范化名称,主要是因为这增加了总是被调用的连接的性能影响,并且没有带来任何好处。

If you have so many similar names, and have a storage problem then it may be worth it, but there will be a performance hit that would need to be considered.

如果您有如此多相似的名称,并且存在存储问题,那么它可能是值得的,但是需要考虑性能的影响。

#13


0  

I would say it is absolutely overkill. In most applications, you display folks' names so often, every query involved with that is going to look that much more complex and harder to read.

我得说这绝对是夸大其词。在大多数应用程序中,经常显示用户名,涉及到的每个查询看起来都要复杂得多,也更难读。

#14


0  

Yes, it is. It is commonly recognized that just applying all of the Rules of Normalization can cause you to go way too far and end up with an overnormalized database. For example, it would be possible to normalize every instance of every character to a reference to a character enumeration table. It's easy to see that that's ridiculous.

是的,它是。一般认为,仅仅应用所有的规范化规则就会使您走得太远,最终导致一个过度规范化的数据库。例如,可以将每个字符的每个实例规范化为对字符枚举表的引用。很容易看出那是荒谬的。

Normalization needs to be performed at a level that is appropriate for your problem domain. Overnormalization is as much a problem as undernormalization (although, of course, for different reasons).

规范化需要在适合您的问题域的级别上执行。过度归一化和不归一化一样是个问题(当然,由于不同的原因)。

#15


0  

There might be a case where being able to link married/maiden names would be useful.
Recently had a case where I had to rename thousands of emails in exchange because somebody got divorced and didn't want any emails listing her as married_name@company.com

可能有这样一种情况:能够将已婚/未婚姓名联系起来是有用的。最近有一个案例,我不得不重命名数千封电子邮件作为交换,因为有人离婚了,不想要任何将她列为married_name@company.com的电子邮件

#16


0  

No need to normalize to that level unless the names make up a composite primary key and you have data that is dependant on one of the names (e.g. anyone with the surname Plummer knows nothing about databases). In which case, by not normalizing, you would violate second normal form.

不需要规范化到这个级别,除非名称组成一个复合主键,并且您有依赖于其中一个名称的数据(例如,任何姓Plummer的人对数据库一无所知)。在这种情况下,如果不正常,就会违反第二范式。

#17


0  

I agree with the general response, you wouldn't do that.

我同意大家的回答,你不会那样做的。

One thing comes to mind though, compression. If you had a billion people and you found that 60% of first names were pulled from 5 very common names, you could use some tricky bit manipulation to reduce the size very significantly. It would also require very customized database software.

但有一件事是要注意的,那就是压缩。如果你有10亿人,你发现60%的名字是从5个非常常见的名字中抽取出来的,你可以使用一些巧妙的位操作来显著地减少名字的大小。它还需要非常定制的数据库软件。

But this isn't for the purpose of normalization, just compression.

但这不是为了标准化,只是为了压缩。

#18


0  

You should normalize it out if you need to avoid the delete anomaly that comes with not breaking it out. That is, if you ever need to answer the question, has my database ever had a person named "Joejimbobjake" in it, you need to avoid the anomaly. Soft deletes is probably a much better way than having a comprehensive first name table (for example), but you get my point.

如果您需要避免由于没有中断而带来的删除异常,则应该对其进行规范化。也就是说,如果你需要回答这个问题,我的数据库曾经有一个叫“Joejimbobjake”的人,你需要避免这个异常。软删除可能比拥有一个完整的first name表(例如)要好得多,但是您明白我的意思。

#19


0  

In addition to all the points everyone else has made, consider that if you were implementing a data entry operation (for example), and were to insert a new contact, you would have to search your first name and last name tables to locate the correct Id's and then use those values. But then this is further complicated by the occasion when the name is not on the FN and/or LN tables, then you have to insert the new first/last name and use the new id(s).

除了所有的点其他人了,考虑到如果你实现一个数据输入操作(例如),并插入一个新的接触,你会搜索你的名和姓表来定位正确的Id,然后使用这些值。但当名称不在FN和/或LN表上时,这就更加复杂了,然后必须插入新的姓/名并使用新的id。

And if you think that you have a comprehensive list of names, think again. I work with a list of over 200k unique first names and I'd guess it represents 99.9% of the US population. But that .1% = a lot of people. And don't forget the foreign names and misspellings...

如果你认为你有一个完整的名字列表,再想想。我使用了超过200k的唯一名字列表,我猜它代表了99.9%的美国人口。但那。1% =很多人。别忘了外号和拼写错误……