谁能告诉我为什么”аnd“= =”和“是假的吗?

时间:2022-06-06 16:56:37

I tagged character-encoding and text because I know if you type 'and' == 'and' into the rails console, or most any other programming language, you will get true. However, I am having the issue when one of my users pastes his text into my website, I can't spell check it properly or verify it's originality via copyscape because of some issue with the text. (or maybe my understanding of text encoding?)

我标记了字符编码和文本,因为我知道如果您在rails控制台或大多数其他编程语言中输入'and' = '和',您就会得到true。然而,我有一个问题,当我的一个用户将他的文本粘贴到我的网站上时,我不能正确拼写检查它,也不能通过copyscape验证它的原创性,因为文本存在一些问题。(或者我对文本编码的理解?)

EXAMPLE:

例子:

If you copy and paste the following line into the rails console you will get false.

如果您将以下代码复制并粘贴到rails控制台,您将得到false。

'аnd' == 'and' #=> false

If you copy and paste the following line into the rails console you will get true even though they appear exactly the same in the browser.

如果您将以下一行复制并粘贴到rails控制台,您将得到true,即使它们在浏览器中显示的完全相同。

'and' == 'and' #=> true

The difference is, in the first example, the first 'аnd' is copied and pasted from my user's text that is causing the issues. All the other instances of 'and' are typed into the browser.

不同的是,在第一个例子中,第一个“аnd”从我的用户的复制粘贴文本导致的问题。“and”的所有其他实例都输入到浏览器中。

Is this an encoding issue? Any ideas on how to fix my issue?

这是编码问题吗?有什么办法解决我的问题吗?

Thanks!

谢谢!

3 个解决方案

#1


5  

This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.

这并不是真正的编码问题,在第一种情况下,字符串比较为false仅仅是因为它们是不同的。

The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.

第一个字符串的第一个字符不是“正常”的a,它实际上是U+0430 CYRILLIC小写字母a——前两个字节(208和176,或者十六进制的0xD0和0xB0)是这个字符的UTF-8编码。它正好看起来像一个“正常”的拉丁字母a,也就是U+0061拉丁字母a。

Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.

这是“正常”:一个,这是斯拉夫字母答:а,他们看起来几乎相同。

The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.

对此的修复实际上取决于您希望应用程序做什么。理想情况下,您希望处理所有语言,因此您可能希望只保留这些语言,并依赖于用户提供合理的输入。

You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.

你可以用一个拉丁字母a代替这个字符,例如gsub。问题是,还有很多其他的角色和我们所熟悉的角色有着相似的外表。如果你选择这条路线,你最好找一个为你做的图书馆/珠宝,你可能会发现你对转换太严格了。

Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.

另一种选择是选择应用程序支持的一组Unicode脚本,并拒绝这些脚本之外的任何字符。您可以很容易地用Ruby的正则表达式脚本支持来检查它,例如/\p{Cyrillic}/将匹配所有的Cyrillic字符。

#2


5  

The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.

问题不在于编码。一个文件或一个终端只能有一个编码。如果您将两个字符串复制并粘贴到相同的源文件或相同的终端窗口中,它们将被插入到相同的编码中。

The problem is also not with normalization or folding.

问题也不在于规范化或折叠。

The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.

第一个字符串有四个octets: 0xD0 0xB0 0x6E 0x64。前两个八进制数是一个单Unicode码点的两个八进制数UTF-8编码,第三个和第四个八进制数是Unicode码点的一个八进制数UTF-8编码。

So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.

因此,该字符串由三个Unicode码点组成:U+0430 U+006E U+0064。

These three codepoints resolve to the following three characters:

这三个码点分解为以下三个字符:

  1. CYRILLIC SMALL LETTER A
  2. 斯拉夫字母小写字母的
  3. LATIN SMALL LETTER N
  4. 拉丁小字母N
  5. LATIN SMALL LETTER D
  6. 拉丁小写字母D

The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.

第二个字符串有3个八进制:0x61 0x6E 0x64。这三个八进制都是一个八位字节的Unicode码点编码。

So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.

因此,该字符串由三个Unicode码点组成:U+0061 U+006E U+0064。

These three codepoints resolve to the following three characters:

这三个码点分解为以下三个字符:

  1. LATIN SMALL LETTER A
  2. 拉丁小信
  3. LATIN SMALL LETTER N
  4. 拉丁小字母N
  5. LATIN SMALL LETTER D
  6. 拉丁小写字母D

Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.

真的,一点问题都没有!这两个弦是不同的。使用您正在使用的字体,cyrillic a看起来与拉丁文a相同,但就Unicode而言,它们是两个不同的字符。(换一种字体,它们看起来甚至可能不一样!)从编码或Unicode的角度来看,实际上没有什么可以做的,因为问题不在于编码或Unicode。

This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.

这被称为同形文字,两个字符是不同的,但是有相同的(或非常相似的)字形。

What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:

你可以试着把所有的字符串转换成拉丁文(前提是你可以保证没有人想要输入非拉丁字符),但实际上,问题是:

  1. Where does that cyrillic a come from?
  2. 这个西里尔字母是从哪儿来的?
  3. Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
  4. 也许它是一个西里尔字母a,真的应该被对待不等于拉丁字母a?

And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.

根据这些问题的答案,你可能想要修复源代码,或者干脆什么都不做。

This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.

顺便说一句,这对浏览器供应商来说是一个非常热门的话题,因为现在有人可以注册域google。com(用homoglpyh交换其中的一个字母),你将无法发现地址栏的差异。这叫做同轴图攻击。这就是为什么除了Unicode域名外,它们总是显示Punycode域。

#3


1  

I think it is eccoding issue, you can have a try like this.

我认为这是一个编码问题,你可以试试这个。

irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"

irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"

#1


5  

This isn’t really an encoding problem, in the first case the strings compare as false simply because they are different.

这并不是真正的编码问题,在第一种情况下,字符串比较为false仅仅是因为它们是不同的。

The first character of the first string isn’t a ”normal“ a, it is actually U+0430 CYRILLIC SMALL LETTER A — the first two bytes (208 and 176, or 0xD0 and 0xB0 in hex) are the UTF-8 encoding for this character. It just happens to look exactly like a “normal” Latin a, which is U+0061 LATIN SMALL LETTER A.

第一个字符串的第一个字符不是“正常”的a,它实际上是U+0430 CYRILLIC小写字母a——前两个字节(208和176,或者十六进制的0xD0和0xB0)是这个字符的UTF-8编码。它正好看起来像一个“正常”的拉丁字母a,也就是U+0061拉丁字母a。

Here’s the “normal” a: a, and this is the Cyrillic a: а, they appear pretty much identical.

这是“正常”:一个,这是斯拉夫字母答:а,他们看起来几乎相同。

The fix for this really depends on what you want your application to do. Ideally you would want to handle all languages, and so you might want to just leave it and rely on users to provide reasonable input.

对此的修复实际上取决于您希望应用程序做什么。理想情况下,您希望处理所有语言,因此您可能希望只保留这些语言,并依赖于用户提供合理的输入。

You could replace the character in question with a latin a using e.g. gsub. The problem with that is there are many other characters that have similar appearance to the more familiar ones. If you choose this route you would be better looking for a library/gem that did it for you, and you might find you’re too strict about conversions.

你可以用一个拉丁字母a代替这个字符,例如gsub。问题是,还有很多其他的角色和我们所熟悉的角色有着相似的外表。如果你选择这条路线,你最好找一个为你做的图书馆/珠宝,你可能会发现你对转换太严格了。

Another option could be to choose a set of Unicode scripts that your application supports and refuse any characters outside those scripts. You can check fairly easily for this with Ruby‘s regular expression script support, e.g. /\p{Cyrillic}/ will match all Cyrillic characters.

另一种选择是选择应用程序支持的一组Unicode脚本,并拒绝这些脚本之外的任何字符。您可以很容易地用Ruby的正则表达式脚本支持来检查它,例如/\p{Cyrillic}/将匹配所有的Cyrillic字符。

#2


5  

The problem is not with encodings. A single file or a single terminal can only have a single encoding. If you copy and paste both strings into the same source file or the same terminal window, they will get inserted with the same encoding.

问题不在于编码。一个文件或一个终端只能有一个编码。如果您将两个字符串复制并粘贴到相同的源文件或相同的终端窗口中,它们将被插入到相同的编码中。

The problem is also not with normalization or folding.

问题也不在于规范化或折叠。

The first string has 4 octets: 0xD0 0xB0 0x6E 0x64. The first two octets are a two-octet UTF-8 encoding of a single Unicode codepoint, the third and fourth octets are one-octet UTF-8 encodings of Unicode code points.

第一个字符串有四个octets: 0xD0 0xB0 0x6E 0x64。前两个八进制数是一个单Unicode码点的两个八进制数UTF-8编码,第三个和第四个八进制数是Unicode码点的一个八进制数UTF-8编码。

So, the string consists of three Unicode codepoints: U+0430 U+006E U+0064.

因此,该字符串由三个Unicode码点组成:U+0430 U+006E U+0064。

These three codepoints resolve to the following three characters:

这三个码点分解为以下三个字符:

  1. CYRILLIC SMALL LETTER A
  2. 斯拉夫字母小写字母的
  3. LATIN SMALL LETTER N
  4. 拉丁小字母N
  5. LATIN SMALL LETTER D
  6. 拉丁小写字母D

The second string has 3 octets: 0x61 0x6E 0x64. All three octets are one-octet UTF-8 encodings of Unicode code points.

第二个字符串有3个八进制:0x61 0x6E 0x64。这三个八进制都是一个八位字节的Unicode码点编码。

So, the string consists of three Unicode codepoints: U+0061 U+006E U+0064.

因此,该字符串由三个Unicode码点组成:U+0061 U+006E U+0064。

These three codepoints resolve to the following three characters:

这三个码点分解为以下三个字符:

  1. LATIN SMALL LETTER A
  2. 拉丁小信
  3. LATIN SMALL LETTER N
  4. 拉丁小字母N
  5. LATIN SMALL LETTER D
  6. 拉丁小写字母D

Really, there is no problem at all! The two strings are different. With the font you are using, a cyrillic a looks the same as a latin a, but as far as Unicode is concerned, they are two different characters. (And in a different font, they might even look different!) There's really nothing you can do from an encoding or Unicode perspective, because the problem is not with encodings or Unicode.

真的,一点问题都没有!这两个弦是不同的。使用您正在使用的字体,cyrillic a看起来与拉丁文a相同,但就Unicode而言,它们是两个不同的字符。(换一种字体,它们看起来甚至可能不一样!)从编码或Unicode的角度来看,实际上没有什么可以做的,因为问题不在于编码或Unicode。

This is called a homoglyph, two characters that are different but have the same (or very similar) glyphs.

这被称为同形文字,两个字符是不同的,但是有相同的(或非常相似的)字形。

What you could try to do is transliterate all strings into Latin (provided that you can guarantee that nobody ever wants to enter non-Latin characters), but really, the questions are:

你可以试着把所有的字符串转换成拉丁文(前提是你可以保证没有人想要输入非拉丁字符),但实际上,问题是:

  1. Where does that cyrillic a come from?
  2. 这个西里尔字母是从哪儿来的?
  3. Maybe it was meant to be a cyrillic a and really should be treated not-equal to a latin a?
  4. 也许它是一个西里尔字母a,真的应该被对待不等于拉丁字母a?

And depending on the answers to those questions, you might either want to fix the source, or just do nothing at all.

根据这些问题的答案,你可能想要修复源代码,或者干脆什么都不做。

This is a very hot topic for browser vendors, BTW, because nowadays someone could register the domain google.com (with one of the letters switched out for a homoglpyh) and you wouldn't be able to spot the difference in the address bar. This is called a homograph attack. That's why they always display the Punycode domain in addition to the Unicode domain name.

顺便说一句,这对浏览器供应商来说是一个非常热门的话题,因为现在有人可以注册域google。com(用homoglpyh交换其中的一个字母),你将无法发现地址栏的差异。这叫做同轴图攻击。这就是为什么除了Unicode域名外,它们总是显示Punycode域。

#3


1  

I think it is eccoding issue, you can have a try like this.

我认为这是一个编码问题,你可以试试这个。

irb(main):010:0> 'and'.each_byte {|b| puts b}
97
110
100
=> "and"

irb(main):011:0> 'аnd'.each_byte {|b| puts b} #copied and
208
176
110
100
=> "аnd"