什么是Unicode, UTF-8, UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

Unicode的基础是什么?为什么需要UTF-8或UTF-16?我也在谷歌上搜索过，但是我不太清楚。

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

在VSS中进行文件比较时，有时会有消息说这两个文件具有不同的UTF。为什么会这样?

Please explain in simple terms.

请用简单的术语解释。

8 个解决方案

#1

412

Why do we need Unicode?

In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).

在早期(不算太早)，所有存在的都是ASCII码。这是可以的，因为所需要的只是一些控制字符，标点符号，数字和字母，就像这句话中的那些。不幸的是,今天的全球交际的奇怪的世界和社会媒体并没有预见到,这是不太寻常的看英语,العربية,汉语,עִבְרִית,ελληνικά,ភាសាខ្មែរ在同一文档中(我希望我没有打破任何旧的浏览器)。

But for argument's sake, lets say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.

但是为了讨论的方便，我们假设Joe Average是一个软件开发人员。他坚持说他只需要英语，因此只需要使用ASCII码。这对于Joe这个用户来说是可以的，但是对于软件开发人员Joe来说就不行了。世界上大约有一半的人使用非拉丁字符，而使用ASCII无疑是对这些人不体贴，除此之外，他还将自己的软件与一个庞大且不断增长的经济体隔绝。

Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.

因此，需要包含所有语言的包含字符集。因此是Unicode。它为每个字符分配一个称为代码点的唯一数字。Unicode相对于其他可能的集合的一个优点是，前256个代码点与ISO-8859-1相同，因此也是ASCII。此外，在一个称为基本多语言平面(BMP)的区域中，绝大多数常用字符只有两个字节表示。现在需要一个字符编码来访问这个字符集，正如问题所问的，我将集中讨论UTF-8和UTF-16。

Memory considerations

So how many bytes give access to what characters in these encodings?

那么有多少字节可以访问这些编码中的哪些字符呢?

UTF-8:
- 1 byte: Standard ASCII
- 1个字节:标准ASCII
- 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
- 2字节:阿拉伯语、希伯来语、大多数欧洲脚本(最明显的是不包括格鲁吉亚语)
- 3 bytes: BMP
- 3个字节:骨形态发生蛋白
- 4 bytes: All Unicode characters
- 4字节:所有Unicode字符
UTF-8: 1字节:标准ASCII 2字节:阿拉伯语、希伯来语、大多数欧洲脚本(最明显的不包括格鲁吉亚语)3字节:BMP 4字节:所有Unicode字符
UTF-16:
- 2 bytes: BMP
- 骨形态发生蛋白2字节:
- 4 bytes: All Unicode characters
- 4字节:所有Unicode字符
UTF-16: 2字节:BMP 4字节:所有Unicode字符

It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.

值得一提的是，BMP中没有的汉字包括古代的文字、数学符号、音乐符号以及更稀有的汉语/日语/韩语字符。

If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.

如果您主要使用ASCII字符，那么UTF-8肯定更节省内存。但是，如果您主要使用非欧洲脚本，使用UTF-8的内存效率可能比UTF-16低1.5倍。当处理大量的文本时，例如大型web页面或冗长的word文档，这可能会影响性能。

Encoding basics

^{Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.}

注意:如果您知道UTF-8和UTF-16是如何编码的，请跳到下一节以了解实际应用程序。

UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid *ing with the ASCII characters.
UTF-8:对于标准ASCII(0-127)字符，UTF-8代码是相同的。如果需要与现有的ASCII文本向后兼容，这使得UTF-8非常理想。其他字符需要2-4字节。这是通过在每个字节中保留一些位来表示它是多字节字符的一部分来实现的。特别是，每个字节的第一个字节是1，以避免与ASCII字符发生冲突。
UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
UTF-16:对于有效的BMP字符，UTF-16表示只是它的代码点。然而，对于非bmp字符，UTF-16引入代理对。在本例中，将两个两个字节的部分映射为非bmp字符。这些两个字节的部分来自BMP数字范围，但是Unicode标准保证这些部分作为BMP字符无效。此外，由于UTF-16的基本单元有两个字节，因此受到endianness的影响。为了进行补偿，可以将保留的字节顺序标记放在指示endianness的数据流的开头。因此，如果您正在读取UTF-16输入，并且没有指定任何endianness，那么您必须对此进行检查。

As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.

可以看出，UTF-8和UTF-16之间并没有彼此兼容。因此，如果您正在执行I/O，请确保您知道正在使用哪个编码!有关这些编码的详细信息，请参阅UTF FAQ。

Practical programming considerations

Character and String data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.

字符和字符串数据类型:它们是如何在编程语言中编码的?如果它们是原始字节，当您尝试输出非ascii字符时，您可能会遇到一些问题。而且，即使字符类型基于UTF，也并不意味着字符串是合适的UTF。它们可能允许不合法的字节序列。通常，您将不得不使用支持UTF的库，例如针对C、c++和Java的ICU。在任何情况下，如果您想输入/输出除默认编码之外的其他内容，您必须首先对其进行转换。

Recommended/default/dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.

推荐/默认/主导编码:当给定要使用哪个UTF时，通常最好遵循您正在使用的环境的推荐标准。例如，UTF-8在web上占主导地位，自从HTML5以来，它一直是推荐的编码。相反，. net和Java环境都基于UTF-16字符类型。令人困惑的(和错误的)，引用经常被用于“Unicode编码”，这通常是指在给定的环境中占主导地位的UTF编码。

Library support: What encodings are the libraries you are using support? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.

库支持:您所使用的库支持哪些编码?他们支持角落案例吗?因为必然性是发明之母，所以UTF-8库通常会适当地支持4字节字符，因为1、2甚至3字节字符经常出现。然而，并非所有声称的UTF-16库都正确地支持代理对，因为它们很少出现。

Counting characters: There exist combining characters in Unicode. For example the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, 1 for the latter. This isn't necessarily wrong, but may not be the desired outcome either.

计数字符:在Unicode中存在合并字符的情况。例如代码点U + 006 e(n),和U + 0303(结合波浪号)形式ñ,但是代码点U + 00 f1形式n。它们看起来应该是相同的，但是一个简单的计数算法将返回第一个示例的2，后者的1。这并不一定是错误的，但也可能不是理想的结果。

Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ, one is a letter, the other a Roman numeral. In addition, we have the combining characters to consider as well. For more info see Duplicate characters in Unicode.

平等的比较:一、А和Α看起来一样,但是他们拉丁文,斯拉夫字母,分别和希腊。你也有C和Ⅽ情况下,一个是一个字母,另一个罗马数字。此外，我们还需要考虑组合字符。有关更多信息，请参见Unicode中的重复字符。

Surrogate pairs: These come up often enough on SO, so I'll just provide some example links:

代理对:这些经常出现，所以我只提供一些例子链接:

Getting string length
获取字符串长度
Removing surrogate pairs
删除代理对
Palindrome checking
回文检查

Others?:

其他人呢?:

#2

Unicode
- is a set of characters used around the world
- 世界各地都使用一套汉字吗
Unicode是世界各地使用的一组字符
UTF-8
- a character encoding capable of encoding all possible characters (called code points) in Unicode.
- 可以编码Unicode中所有可能的字符(称为代码点)的字符编码。
- code unit is 8-bits
- 代码单元是8位
- use one to four code units to encode Unicode
- 使用一到四个代码单元来编码Unicode
- 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)
- 00100100“美元”(一个8位);11000010 10100010“¢”(两个8位);11100010 10000010 11100010为“€”(3 8位)
UTF-8是一种能够编码Unicode中所有可能的字符(称为代码点)的字符编码。代码单元是8位使用1到4的代码单位为“$”Unicode编码00100100(8位);11000010 10100010“¢”(两个8位);11100010 10000010 11100010为“€”(3 8位)
UTF-16
- another character encoding
- 另一个字符编码
- code unit is 16-bits
- 代码单元是16位
- use one to two code units to encode Unicode
- 使用一到两个代码单元来编码Unicode。
- 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "????" (two 16-bits)
- 00000000 00100100为"$"(one 16-bits);11011000 01010010 11011111 01010010为"????"(two 16-bits)
utf - 16另一个字符编码代码单元是16位Unicode编码使用一到两个代码单位00000000 00100100为"$"(one 16-bits);11011000 01010010 11011111 01010010为"????"(two 16-bits)

#3

Unicode is a fairly complex standard. Don’t be too afraid, but be prepared for some work! [2]

Unicode是一个相当复杂的标准。不要太害怕，但要为一些工作做好准备![2]

Because a credible resource is always needed, but the official report is massive, I suggest reading the following:

由于一直都需要可靠的资源，但官方报告数量庞大，我建议您阅读以下内容:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
绝对最小值每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)堆栈交换首席执行官乔尔·斯波尔斯基的介绍。
To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium. (first 20 slides and you are done)
祝BMP及其他!Eric Muller，技术总监，后来的副总裁，在Unicode协会的一篇教程。(前20张幻灯片，你完成了)

A brief explanation:

简要说明:

Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (7 bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character max. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).

计算机读取字节，人们读取字符，因此我们使用编码标准将字符映射到字节。ASCII是第一个被广泛使用的标准，但是只涵盖拉丁文(7位/字符可以代表128个不同的字符)。Unicode是一个标准，目标是覆盖世界上所有可能的字符(最多可以容纳114112个字符，即21位/字符的最大值。当前的Unicode 8.0总共指定了120,737个字符，仅此而已)。

The main difference is that an ASCII character can fit to a byte (8 bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:

主要的区别是ASCII字符可以适合一个字节(8位)，但是大多数Unicode字符不能。因此使用编码形式/方案(如UTF-8和UTF-16)，字符模型如下:

Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses 1 to 4 units of 8 bits, and UTF-16 uses 1 or 2 units of 16 bits, to cover the entire Unicode of 21 bits max. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses 1 byte for the Latin script it needs 3 bytes for later scripts inside Basic Multilingual Plane, while UTF-16 uses 2 bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.

每个字符拥有一个从0到1,114,111(十六进制:0- 10ffff)的枚举位置，称为代码点。编码表单将代码点映射到代码单元序列。代码单元是您希望将字符组织在内存中、8位单元、16位单元等的方式。UTF-8使用1到4个8位的单位，而UTF-16使用1到2个16位的单位，以覆盖整个Unicode的21位最大值。单位使用前缀，这样可以发现字符的边界，更多的单位意味着更多的前缀占用。因此，尽管UTF-8在拉丁脚本中使用了1个字节，但在基本的多语言平面中，后面的脚本需要3个字节，而UTF-16在所有这些脚本中使用了2个字节。这就是它们的主要区别。最后，一个编码方案(如UTF-16BE或UTF-16LE)将一个代码单元序列映射到一个字节序列。

character: π
code point: U+03C0
encoding forms (code units):
UTF-8: CF 80
UTF-16: 03C0
encoding schemes (bytes):
UTF-8: CF 80
UTF-16BE: 03 C0
UTF-16LE: C0 03

性格:π代码点:U + 03 C0编码形式(代码单元):utf - 8:CF 80 utf - 16:03 C0编码方案(字节):utf - 8:CF 80 UTF-16BE:03 C0 UTF-16LE:C0 03

Tip: a hex digit represents 4 bits, so a two-digit hex number represents a byte
Also take a look at Plane maps in Wikipedia to get a feeling of the character set layout

提示:一个十六进制数字代表4位，所以一个两位数的十六进制数字代表一个字节，你也可以看看*上的平面地图来了解字符集布局

#4

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

最初，Unicode打算采用固定宽度的16位编码(UCS-2)。早期采用Unicode的人，比如Java和Windows NT，都是围绕16位字符串构建库的。

Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.

稍后，Unicode的范围扩展到包含历史字符，这需要超过16位编码支持的65,536个代码点。为了允许在使用过UCS-2的平台上表示其他字符，引入了UTF-16编码。它使用“代理对”来表示辅助平面中的字符。

Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.

与此同时，许多旧的软件和网络协议都使用8位字符串。UTF-8是为了使这些系统能够支持Unicode而不必使用宽字符。它与7位ASCII向后兼容。

#5

Why unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, that's why there are codepages. So they said lets have up to 1114111 characters. So how do you store the highest codepoint? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a unicode character, it is the easiest way because the value in your DWORD matches exactly the codepoint. But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only utf-32, but also utf-16. But utf-16 means a WORD stream, and a WORD has 16 bits so how can the highest codepoint 1114111 fit into a WORD? It cannot! So they put everyything higher than 65535 into a DWORD which they call a surrogate-pair. Such surrogate-pair are two WORDS and can get detected by looking at the first 6 bits. So what about utf-8? It is a byte array or byte stream, but how can the highest codepoint 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right! They invented utf-8 sequences which means that every codepoint higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the codepoint. Now depending on the sequence, following bytes must follow. A following byte starts with 10, the remaining bits are 6 bits of payload bits and belong to the codepoint. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the codepoint. That's all the magic of utf-8.

为什么unicode ?因为ASCII只有127个字符。在不同的国家，128到255之间的差异是不同的，这就是为什么有代码页。所以他们说我们最多有1114111个字符。那么如何存储最高的码点呢?您将需要使用21位存储它，所以您将使用一个DWORD有32位，其中11位被浪费。因此，如果您使用DWORD来存储unicode字符，这是最简单的方法，因为DWORD中的值与代码点完全匹配。但DWORD数组当然大于单词数组，当然也大于字节数组。这就是为什么不仅有utf-32，还有utf-16。但是utf-16的意思是一个字流，一个字有16位，那么最高的码点1114111怎么能适合一个字呢?它不能!所以他们把所有高于65535的东西都放进一个词里，他们称之为*一对。这两个词是两个词，可以通过观察前6位来检测。utf - 8呢?它是一个字节数组或字节流，但是最高的codepoint 1114111如何能适应一个字节呢?它不能!好的，他们也输入了一个DWORD，对吧?或者一个词，对吧?几乎对吧!他们发明了utf-8序列，这意味着每一个高于127的码点都必须被编码成2字节、3字节或4字节的序列。哇!但是我们如何检测这些序列呢?嗯，所有大于127的都是ASCII码，是一个字节。以110开头的是一个2字节的序列，以1110开头的是一个3字节的序列，以11110开头的是一个4字节的序列。这些所谓的“startbytes”的剩余比特属于codepoint。现在取决于序列，接下来的字节必须跟随。一个字节从10开始，剩下的位是6位有效负载位，属于codepoint。连接startbyte的有效负载位和下面的字节/s，您将拥有codepoint。这就是utf-8的神奇之处。

#6

This article explains all the details http://kunststube.net/encoding/

本文解释了http://kunststube.net/encoding/的所有细节

WRITING TO BUFFER

写入缓冲区

if you write to a 4 byte buffer, symbol あ with UTF8 encoding, your binary will look like this:

如果你写一个4字节缓冲区,象征あ与UTF8编码,二进制是这样的:

00000000 11100011 10000001 10000010

00000000 11100011 00000000 10000010

if you write to a 4 byte buffer, symbol あ with UTF16 encoding, your binary will look like this:

如果你写一个4字节缓冲区,象征あUTF16编码,二进制是这样的:

00000000 00000000 00110000 01000010

00000000 00000000 00000000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

正如您所看到的，这取决于您在内容中使用什么语言，这将相应地影响您的内存。

e.g. For this particular symbol: あ UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

例如对于这个特定的象征:あUTF16编码是更有效的,因为我们有两个空闲字节用于下一个符号。但这并不意味着你必须在日本字母表中使用UTF16。

READING FROM BUFFER

读取缓冲区

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

现在，如果你想读上面的字节，你必须知道它被写入了什么编码，并正确地解码它。

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with 臣 not あ

例如如果你解码:00000000 11100011 10000001 11100011到UTF16编码,你最终将臣不是あ

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. あ symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

注意:编码和Unicode是两个不同的东西。Unicode是一个大(表)，每个符号都映射到一个唯一的代码点。例如あ符号(信)(代码点):30 42(十六进制)。另一方面，编码是一种算法，当存储到硬件时，将符号转换为更合适的方式。

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

#7

Unicode is an standard which maps the characters in all the languages to a particular numeric value called Code Points. The reason it does this is because, it allows different encodings to be possible using the same set of code points.

Unicode是一种将所有语言中的字符映射到称为代码点的特定数值的标准。之所以这样做是因为，它允许使用同一组代码点来实现不同的编码。

UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.

UTF-8和UTF-16就是两个这样的编码。它们将代码点作为输入，并使用一些定义良好的公式对它们进行编码，以生成编码的字符串。

Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements, and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.

选择特定的编码取决于您的需求。不同的编码有不同的内存需求，根据您将要处理的字符，您应该选择使用最少字节序列对这些字符进行编码的编码。

For more in-depth details about Unicode, UTF-8 and UTF-16, you can checkout this article,

有关Unicode、UTF-8和UTF-16的更深入的细节，您可以查看本文，

What every programmer should know about Unicode

每个程序员都应该了解Unicode

#8

UTF stands for stands for Unicode Transformation Format.Basically in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.

UTF代表Unicode转换格式。基本上在当今世界，有数百种其他语言编写的脚本，这些格式不包括前面使用的基本ASCII。因此，UTF应运而生。

UTF-8 has character encoding capabilities and its code unit is 8 bits while that for UTF-16 it is 16 bits.

UTF-8具有字符编码能力，其代码单元为8位，而UTF-16为16位。

#1

412

Why do we need Unicode?

Memory considerations

So how many bytes give access to what characters in these encodings?

那么有多少字节可以访问这些编码中的哪些字符呢?

UTF-8:
- 1 byte: Standard ASCII
- 1个字节:标准ASCII
- 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
- 2字节:阿拉伯语、希伯来语、大多数欧洲脚本(最明显的是不包括格鲁吉亚语)
- 3 bytes: BMP
- 3个字节:骨形态发生蛋白
- 4 bytes: All Unicode characters
- 4字节:所有Unicode字符
UTF-8: 1字节:标准ASCII 2字节:阿拉伯语、希伯来语、大多数欧洲脚本(最明显的不包括格鲁吉亚语)3字节:BMP 4字节:所有Unicode字符
UTF-16:
- 2 bytes: BMP
- 骨形态发生蛋白2字节:
- 4 bytes: All Unicode characters
- 4字节:所有Unicode字符
UTF-16: 2字节:BMP 4字节:所有Unicode字符

It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese/Japanese/Korean (CJK) characters.

值得一提的是，BMP中没有的汉字包括古代的文字、数学符号、音乐符号以及更稀有的汉语/日语/韩语字符。

Encoding basics

^{Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.}

注意:如果您知道UTF-8和UTF-16是如何编码的，请跳到下一节以了解实际应用程序。

UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid *ing with the ASCII characters.
UTF-8:对于标准ASCII(0-127)字符，UTF-8代码是相同的。如果需要与现有的ASCII文本向后兼容，这使得UTF-8非常理想。其他字符需要2-4字节。这是通过在每个字节中保留一些位来表示它是多字节字符的一部分来实现的。特别是，每个字节的第一个字节是1，以避免与ASCII字符发生冲突。
UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
UTF-16:对于有效的BMP字符，UTF-16表示只是它的代码点。然而，对于非bmp字符，UTF-16引入代理对。在本例中，将两个两个字节的部分映射为非bmp字符。这些两个字节的部分来自BMP数字范围，但是Unicode标准保证这些部分作为BMP字符无效。此外，由于UTF-16的基本单元有两个字节，因此受到endianness的影响。为了进行补偿，可以将保留的字节顺序标记放在指示endianness的数据流的开头。因此，如果您正在读取UTF-16输入，并且没有指定任何endianness，那么您必须对此进行检查。

可以看出，UTF-8和UTF-16之间并没有彼此兼容。因此，如果您正在执行I/O，请确保您知道正在使用哪个编码!有关这些编码的详细信息，请参阅UTF FAQ。

Practical programming considerations

Surrogate pairs: These come up often enough on SO, so I'll just provide some example links:

代理对:这些经常出现，所以我只提供一些例子链接:

Getting string length
获取字符串长度
Removing surrogate pairs
删除代理对
Palindrome checking
回文检查

Others?:

其他人呢?:

#2

Unicode
- is a set of characters used around the world
- 世界各地都使用一套汉字吗
Unicode是世界各地使用的一组字符
UTF-8
- a character encoding capable of encoding all possible characters (called code points) in Unicode.
- 可以编码Unicode中所有可能的字符(称为代码点)的字符编码。
- code unit is 8-bits
- 代码单元是8位
- use one to four code units to encode Unicode
- 使用一到四个代码单元来编码Unicode
- 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)
- 00100100“美元”(一个8位);11000010 10100010“¢”(两个8位);11100010 10000010 11100010为“€”(3 8位)
UTF-8是一种能够编码Unicode中所有可能的字符(称为代码点)的字符编码。代码单元是8位使用1到4的代码单位为“$”Unicode编码00100100(8位);11000010 10100010“¢”(两个8位);11100010 10000010 11100010为“€”(3 8位)
UTF-16
- another character encoding
- 另一个字符编码
- code unit is 16-bits
- 代码单元是16位
- use one to two code units to encode Unicode
- 使用一到两个代码单元来编码Unicode。
- 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "????" (two 16-bits)
- 00000000 00100100为"$"(one 16-bits);11011000 01010010 11011111 01010010为"????"(two 16-bits)
utf - 16另一个字符编码代码单元是16位Unicode编码使用一到两个代码单位00000000 00100100为"$"(one 16-bits);11011000 01010010 11011111 01010010为"????"(two 16-bits)

#3

Unicode is a fairly complex standard. Don’t be too afraid, but be prepared for some work! [2]

Unicode是一个相当复杂的标准。不要太害怕，但要为一些工作做好准备![2]

Because a credible resource is always needed, but the official report is massive, I suggest reading the following:

由于一直都需要可靠的资源，但官方报告数量庞大，我建议您阅读以下内容:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
绝对最小值每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)堆栈交换首席执行官乔尔·斯波尔斯基的介绍。
To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium. (first 20 slides and you are done)
祝BMP及其他!Eric Muller，技术总监，后来的副总裁，在Unicode协会的一篇教程。(前20张幻灯片，你完成了)

A brief explanation:

简要说明:

主要的区别是ASCII字符可以适合一个字节(8位)，但是大多数Unicode字符不能。因此使用编码形式/方案(如UTF-8和UTF-16)，字符模型如下:

character: π
code point: U+03C0
encoding forms (code units):
UTF-8: CF 80
UTF-16: 03C0
encoding schemes (bytes):
UTF-8: CF 80
UTF-16BE: 03 C0
UTF-16LE: C0 03

性格:π代码点:U + 03 C0编码形式(代码单元):utf - 8:CF 80 utf - 16:03 C0编码方案(字节):utf - 8:CF 80 UTF-16BE:03 C0 UTF-16LE:C0 03

Tip: a hex digit represents 4 bits, so a two-digit hex number represents a byte
Also take a look at Plane maps in Wikipedia to get a feeling of the character set layout

提示:一个十六进制数字代表4位，所以一个两位数的十六进制数字代表一个字节，你也可以看看*上的平面地图来了解字符集布局

#4

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

最初，Unicode打算采用固定宽度的16位编码(UCS-2)。早期采用Unicode的人，比如Java和Windows NT，都是围绕16位字符串构建库的。

与此同时，许多旧的软件和网络协议都使用8位字符串。UTF-8是为了使这些系统能够支持Unicode而不必使用宽字符。它与7位ASCII向后兼容。

#5

#6

This article explains all the details http://kunststube.net/encoding/

本文解释了http://kunststube.net/encoding/的所有细节

WRITING TO BUFFER

写入缓冲区

if you write to a 4 byte buffer, symbol あ with UTF8 encoding, your binary will look like this:

如果你写一个4字节缓冲区,象征あ与UTF8编码,二进制是这样的:

00000000 11100011 10000001 10000010

00000000 11100011 00000000 10000010

if you write to a 4 byte buffer, symbol あ with UTF16 encoding, your binary will look like this:

如果你写一个4字节缓冲区,象征あUTF16编码,二进制是这样的:

00000000 00000000 00110000 01000010

00000000 00000000 00000000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

正如您所看到的，这取决于您在内容中使用什么语言，这将相应地影响您的内存。

e.g. For this particular symbol: あ UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

例如对于这个特定的象征:あUTF16编码是更有效的,因为我们有两个空闲字节用于下一个符号。但这并不意味着你必须在日本字母表中使用UTF16。

READING FROM BUFFER

读取缓冲区

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

现在，如果你想读上面的字节，你必须知道它被写入了什么编码，并正确地解码它。

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with 臣 not あ

例如如果你解码:00000000 11100011 10000001 11100011到UTF16编码,你最终将臣不是あ

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

#7

Unicode是一种将所有语言中的字符映射到称为代码点的特定数值的标准。之所以这样做是因为，它允许使用同一组代码点来实现不同的编码。

UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.

UTF-8和UTF-16就是两个这样的编码。它们将代码点作为输入，并使用一些定义良好的公式对它们进行编码，以生成编码的字符串。

选择特定的编码取决于您的需求。不同的编码有不同的内存需求，根据您将要处理的字符，您应该选择使用最少字节序列对这些字符进行编码的编码。

For more in-depth details about Unicode, UTF-8 and UTF-16, you can checkout this article,

有关Unicode、UTF-8和UTF-16的更深入的细节，您可以查看本文，

What every programmer should know about Unicode

每个程序员都应该了解Unicode

#8

UTF代表Unicode转换格式。基本上在当今世界，有数百种其他语言编写的脚本，这些格式不包括前面使用的基本ASCII。因此，UTF应运而生。

UTF-8 has character encoding capabilities and its code unit is 8 bits while that for UTF-16 it is 16 bits.

UTF-8具有字符编码能力，其代码单元为8位，而UTF-16为16位。

秒客网

什么是Unicode, UTF-8, UTF-16?

8 个解决方案

#1

Why do we need Unicode?

Memory considerations

Encoding basics

Practical programming considerations

#2

#3

#4

#5

#6

#7

#8

#1

Why do we need Unicode?

Memory considerations

Encoding basics

Practical programming considerations

#2

#3

#4

#5

#6

#7

#8

相关文章