搜索引擎中的多语言数据库编码

时间:2022-03-15 12:56:09

I have a database(Mysql) in which I store more then 100 000 keywords with keyword in different languages. So an example if I have three colums [id] [turkish (utf8_turkish_ci)] [german(utf8)]

我有一个数据库(Mysql),我在其中存储了超过10万个带有不同语言关键字的关键字。如果我有三个colums [id] [土耳其语(utf8_turkish_ci)] [德语(utf8)]

The users could enter a german or a turkish word in the search box. If the user enters a german word all is fine so it prints out the turkish word but how to solve it with the turkish one. I ask because each language has its own additional characters like ä ü ö ş etc.

用户可以在搜索框中输入德语或土耳其语单词。如果用户输入德语单词一切都很好,那么它会输出土耳其语,但如何用土耳其语来解决它。我问,因为每种语言都有自己的附加字符,如äüöş等。

So should I use

我应该使用

mb_convert_encoding

to convert the string but then how to check if it is a german or turkish string I think that would be to complex. Or is the encoding of the tables wrong?

转换字符串,然后如何检查它是德国或土耳其字符串我认为这将是复杂的。或者表的编码是错误的?

Stuck now so how to implement it so the user could enter keyword of both languages words

现在坚持如何实现它,以便用户可以输入两个语言单词的关键字

1 个解决方案

#1


0  

You have several issues to solve to make this work correctly.

您需要解决几个问题才能使其正常工作。

First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.

首先,您选择了utf8字符集来保存所有文本。这是一个不错的选择。如果这是2016年的新应用程序,您可以选择utf8mb4字符集。选择字符集后,用户应该能够阅读文本。

Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

其次,为了搜索和排序(WHERE和ORDER BY),您需要为每种语言选择合适的排序规则。对于现代德语,utf8_general_ci可以很好地工作。如果你需要标准的词法排序,utf8_unicode_ci可以更好一些。读这个。 http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.

对于现代西班牙语,您应该使用utf8_spanish_ci。那是因为在西班牙语中,N和Ñ字符不被认为是相同的。我不知道整体整理是否适用于土耳其语。

Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.

请注意,您似乎在问题中混淆了字符集和排序规则的概念。您已经提到了使用土耳其语列的排序规则和带有德语列的字符集。

You can explicitly specify character set and collation in queries. For example, you can write

您可以在查询中明确指定字符集和排序规则。例如,你可以写

    WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;

In this expression, _utf8 'München' is a character constant, and

在这个表达式中,_utf8'München'是一个字符常量,和

   constant COLLATE utf8_unicode_ci = table.name

is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html

是一个查询说明符,它包含一个显式的排序规则名称。阅读本文.http://dev.mysql.com/doc/refman/5.7/en/charset-collat​​e.html

Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.

第三,您可能希望为每个特定于语言的列分配默认排序规则。默认排序规则被烘焙到索引中,因此它们将有助于加速搜索。

Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

第四,您的用户需要使用适当的输入方法(键盘映射等)向您的应用程序提供数据。土耳其语用户希望知道如何输入土耳其语单词。

#1


0  

You have several issues to solve to make this work correctly.

您需要解决几个问题才能使其正常工作。

First, you've chosen the utf8 character set to hold all your text. That is a good choice. If this is a new-in-2016 application, you might choose the utf8mb4 character set instead. Once you have chosen a character set your users should be able to read your text.

首先,您选择了utf8字符集来保存所有文本。这是一个不错的选择。如果这是2016年的新应用程序,您可以选择utf8mb4字符集。选择字符集后,用户应该能够阅读文本。

Second, for the sake of searching and sorting (WHERE and ORDER BY) you need to choose an appropriate collation for each language. For modern German, utf8_general_ci will work tolerably well. utf8_unicode_ci works a little better if you need standard lexical ordering. Read this. http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

其次,为了搜索和排序(WHERE和ORDER BY),您需要为每种语言选择合适的排序规则。对于现代德语,utf8_general_ci可以很好地工作。如果你需要标准的词法排序,utf8_unicode_ci可以更好一些。读这个。 http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

For modern Spanish, you should use utf8_spanish_ci. That's because in Spanish the N and Ñ characters are not considered the same. I don't know whether the general collation works for Turkish.

对于现代西班牙语,您应该使用utf8_spanish_ci。那是因为在西班牙语中,N和Ñ字符不被认为是相同的。我不知道整体整理是否适用于土耳其语。

Notice that you seem to have confused the notions of character set and collation in your question. You've mentioned a collation with your Turkish column and a character set with your German column.

请注意,您似乎在问题中混淆了字符集和排序规则的概念。您已经提到了使用土耳其语列的排序规则和带有德语列的字符集。

You can explicitly specify character set and collation in queries. For example, you can write

您可以在查询中明确指定字符集和排序规则。例如,你可以写

    WHERE _utf8 'München' COLLATE utf8_unicode_ci = table.name;

In this expression, _utf8 'München' is a character constant, and

在这个表达式中,_utf8'München'是一个字符常量,和

   constant COLLATE utf8_unicode_ci = table.name

is a query specifier which includes an explicit collation name. Read this.http://dev.mysql.com/doc/refman/5.7/en/charset-collate.html

是一个查询说明符,它包含一个显式的排序规则名称。阅读本文.http://dev.mysql.com/doc/refman/5.7/en/charset-collat​​e.html

Third, you may want to assign a default collation to each language specific column. Default collations are baked into indexes, so they'll help accelerate searching.

第三,您可能希望为每个特定于语言的列分配默认排序规则。默认排序规则被烘焙到索引中,因此它们将有助于加速搜索。

Fourth, your users will need to use an appropriate input method (keyboard mapping, etc) to present data to your application. Turkish-language users hopefully know how to type Turkish words.

第四,您的用户需要使用适当的输入方法(键盘映射等)向您的应用程序提供数据。土耳其语用户希望知道如何输入土耳其语单词。