在MySQL数据库中保持文本字段唯一的最佳方法

时间:2021-11-25 22:18:21

I want make value of TEXT field unique in my MySQL table.

我想让文本字段值在MySQL表中是唯一的。

After small research I discovered that everybody are discouraging using UNIQUE INDEX on TEXT fields, due to performance issues. What I want to use now is:

经过小小的研究,我发现由于性能问题,每个人都不喜欢在文本字段上使用唯一索引。我现在想用的是:

1) create another field to contain hash of TEXT value (md5(text_value))

1)创建另一个字段来包含文本值的散列(md5(text_value))

2) make this hash field UNIQUE index

2)使这个哈希字段成为唯一索引

3) use INSERT IGNORE in queries

3)在查询中使用INSERT IGNORE

Is this solution complete, secure and optimal? (found it on SO)

这个解决方案是否完整、安全且最优?(发现)

Is there a better way of achiving this goal?

有没有更好的方法来实现这个目标?

2 个解决方案

#1


3  

As I was asked in the comments how I would solve this, I'll write it as a response.

当我在评论中被问到如何解决这个问题时,我将把它作为一种回应。

Being in such a situation suggests mistakes in the application design. Consider what that means.

在这种情况下,会出现应用程序设计中的错误。想一想这意味着什么。

You have a text of which you cannot specify the length in advance, and which can be extremely long (up to 64k), of which you want to keep uniqueness. Imagine such an amount of data split into separate keys, and composing a composite index to generate uniqueness. This is what you're trying to do. For integers, this would be an index of 16000 integers, joined in a composite index.

您无法预先指定文本的长度,而且文本的长度可能非常长(最多64k),您希望保持其惟一性。假设有这么多数据被分割成不同的键,并组合一个复合索引来生成惟一性。这就是你要做的。对于整数,这将是16000个整数的索引,并加入到复合索引中。

Consider further that CHARACTER type fields (CHAR, VARCHAR, TEXT) underly interpretation by encoding, which further complicates the issue.

进一步考虑字符类型字段(CHAR、VARCHAR、TEXT)通过编码进行的低解释,这进一步使问题复杂化。

I'd highly recommend splitting the data up somehow. This not only frees the DBMS from incorporating variable length character blocks, but also might give some possibility of generating composite keys over parts of the data. Maybe you could even find a better storage solution for your data.

我强烈建议以某种方式分割数据。这不仅可以使DBMS避免合并可变长度的字符块,而且还可以在数据的部分上生成复合键。也许你甚至可以为你的数据找到更好的存储解决方案。

If you have questions, I'd suggest posting the table and/or database structure and explaining what logical data the TEXT field contains, and why you think it would need to be unique.

如果您有问题,我建议您发布表和/或数据库结构,并解释文本字段包含哪些逻辑数据,以及为什么您认为它必须是惟一的。

#2


1  

It’s almost complete. There is a chance (Birthday Paradox) that there will be a collision with a hash so a UNIQUE index alone isn’t enough.

这几乎是完整的。有一个偶然的机会(生日悖论),将会有与散列的冲突,因此单独的索引是不够的。

You’re better using a hash along with a comparison check to be completely safe.

最好使用散列和比较检查以确保完全安全。

SELECT COUNT(*) FROM table
WHERE md5hash = MD5(text)
AND textvalue = text

This could be wrapped into an INSERT or UPDATE TRIGGER – or maybe even a STORED PROCEDUR for easy checking.

可以将其封装到插入或更新触发器中,甚至可以将其封装到存储过程中,以便于进行简单的检查。

Have a look at this Stack Overflow question for an example of a hash collision.

查看这个堆栈溢出问题,了解一个哈希冲突的例子。

#1


3  

As I was asked in the comments how I would solve this, I'll write it as a response.

当我在评论中被问到如何解决这个问题时,我将把它作为一种回应。

Being in such a situation suggests mistakes in the application design. Consider what that means.

在这种情况下,会出现应用程序设计中的错误。想一想这意味着什么。

You have a text of which you cannot specify the length in advance, and which can be extremely long (up to 64k), of which you want to keep uniqueness. Imagine such an amount of data split into separate keys, and composing a composite index to generate uniqueness. This is what you're trying to do. For integers, this would be an index of 16000 integers, joined in a composite index.

您无法预先指定文本的长度,而且文本的长度可能非常长(最多64k),您希望保持其惟一性。假设有这么多数据被分割成不同的键,并组合一个复合索引来生成惟一性。这就是你要做的。对于整数,这将是16000个整数的索引,并加入到复合索引中。

Consider further that CHARACTER type fields (CHAR, VARCHAR, TEXT) underly interpretation by encoding, which further complicates the issue.

进一步考虑字符类型字段(CHAR、VARCHAR、TEXT)通过编码进行的低解释,这进一步使问题复杂化。

I'd highly recommend splitting the data up somehow. This not only frees the DBMS from incorporating variable length character blocks, but also might give some possibility of generating composite keys over parts of the data. Maybe you could even find a better storage solution for your data.

我强烈建议以某种方式分割数据。这不仅可以使DBMS避免合并可变长度的字符块,而且还可以在数据的部分上生成复合键。也许你甚至可以为你的数据找到更好的存储解决方案。

If you have questions, I'd suggest posting the table and/or database structure and explaining what logical data the TEXT field contains, and why you think it would need to be unique.

如果您有问题,我建议您发布表和/或数据库结构,并解释文本字段包含哪些逻辑数据,以及为什么您认为它必须是惟一的。

#2


1  

It’s almost complete. There is a chance (Birthday Paradox) that there will be a collision with a hash so a UNIQUE index alone isn’t enough.

这几乎是完整的。有一个偶然的机会(生日悖论),将会有与散列的冲突,因此单独的索引是不够的。

You’re better using a hash along with a comparison check to be completely safe.

最好使用散列和比较检查以确保完全安全。

SELECT COUNT(*) FROM table
WHERE md5hash = MD5(text)
AND textvalue = text

This could be wrapped into an INSERT or UPDATE TRIGGER – or maybe even a STORED PROCEDUR for easy checking.

可以将其封装到插入或更新触发器中,甚至可以将其封装到存储过程中,以便于进行简单的检查。

Have a look at this Stack Overflow question for an example of a hash collision.

查看这个堆栈溢出问题,了解一个哈希冲突的例子。