UTF-8和ISO-8859-1之间的区别是什么?

时间:2023-01-06 15:48:27

What is the difference between UTF-8 and ISO-8859-1?

UTF-8和ISO-8859-1之间的区别是什么?

6 个解决方案

#1


242  

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

UTF-8是一个多字节编码,可以表示任何Unicode字符。ISO 8859-1是一种单字节编码,可以表示第一个256个Unicode字符。两者都以同样的方式编码ASCII。

#2


102  

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.

*解释得很好:UTF-8 vs Latin-1 (ISO-8859-1)。前者是一种可变长度编码,后者是单字节固定长度编码。Latin-1编码了Unicode字符集的第一个256个代码点,而UTF-8可以用来编码所有代码点。在物理编码级别,只有codepoints 0 - 127编码相同;代码点128 - 255与UTF-8成2字节序列不同,而它们是单个字节与Latin-1。

#3


38  

UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be reperesentative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^25 [roughly 32 million] code points.

UTF是一个家庭的多字节编码方案可以表示Unicode代码点可reperesentative 2 ^ 31(约20亿)的字符。utf - 8是一种灵活的编码系统,使用1到4个字节来表示第一个2 ^ 25(约3200万)代码点。

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particular of encoding best explained here.

长话短说:任何具有代码点/序数表示的字符都在127,也就是7位安全的ASCII码中,以相同的1字节序列表示,就像大多数其他单字字节编码一样。任何带有超过127号代码点的字符都是由两个或多个字节的序列表示的,其中的编码最好在这里解释。

ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.

ISO-8859是一组单字字节编码方案,用于表示可在127至255范围内表示的字母表。这些不同的字母被定义为ISO-8859-n格式的“部件”,其中最熟悉的可能是ISO-8859-1“Latin-1”。与UTF-8一样,7位安全的ASCII仍然不受影响,不管使用的编码家庭是什么。

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.

这种编码方案的缺点是它不能适应由超过128个符号组成的语言,或者一次安全地显示多个符号族。同时,ISO-8859编码也因UTF的崛起而失宠。在2004年解散的ISO“工作小组”,把维修工作留给了它的家长小组委员会。

#4


13  

ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).

ISO-8859-1是20世纪80年代的遗留标准。它只能代表256个字符,所以只适用于西方世界的某些语言。即使对于许多支持的语言,也缺少一些字符。如果您在这个编码中创建一个文本文件并尝试复制/粘贴一些中文字符,您将看到奇怪的结果。换句话说,不要用它。Unicode已经占据了整个世界,而UTF-8在这些日子里几乎是标准的,除非你有一些遗留的原因(比如HTTP头需要兼容所有的东西)。

#5


3  

My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.

我研究这个问题的原因是,从这个角度看,它们是如何兼容的。Latin1 charset (iso-8859) 100%兼容存储在utf8数据存储中。所有ascii和扩展ascii字符将以单字节的形式存储。

Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

另一种方式,从utf8到Latin1字符集可能工作,也可能不工作。如果有任何2字节的chars(超出扩展的ascii 255),它们将不会存储在Latin1数据存储中。

#6


0  

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.

从另一个角度来看,unicode和ascii编码的文件都不能读取,因为它们有一个字节0xc0,似乎可以正确地读取iso-8859-1。需要注意的是,文件中不应该有unicode字符。

#1


242  

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

UTF-8是一个多字节编码,可以表示任何Unicode字符。ISO 8859-1是一种单字节编码,可以表示第一个256个Unicode字符。两者都以同样的方式编码ASCII。

#2


102  

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.

*解释得很好:UTF-8 vs Latin-1 (ISO-8859-1)。前者是一种可变长度编码,后者是单字节固定长度编码。Latin-1编码了Unicode字符集的第一个256个代码点,而UTF-8可以用来编码所有代码点。在物理编码级别,只有codepoints 0 - 127编码相同;代码点128 - 255与UTF-8成2字节序列不同,而它们是单个字节与Latin-1。

#3


38  

UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be reperesentative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^25 [roughly 32 million] code points.

UTF是一个家庭的多字节编码方案可以表示Unicode代码点可reperesentative 2 ^ 31(约20亿)的字符。utf - 8是一种灵活的编码系统,使用1到4个字节来表示第一个2 ^ 25(约3200万)代码点。

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particular of encoding best explained here.

长话短说:任何具有代码点/序数表示的字符都在127,也就是7位安全的ASCII码中,以相同的1字节序列表示,就像大多数其他单字字节编码一样。任何带有超过127号代码点的字符都是由两个或多个字节的序列表示的,其中的编码最好在这里解释。

ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.

ISO-8859是一组单字字节编码方案,用于表示可在127至255范围内表示的字母表。这些不同的字母被定义为ISO-8859-n格式的“部件”,其中最熟悉的可能是ISO-8859-1“Latin-1”。与UTF-8一样,7位安全的ASCII仍然不受影响,不管使用的编码家庭是什么。

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.

这种编码方案的缺点是它不能适应由超过128个符号组成的语言,或者一次安全地显示多个符号族。同时,ISO-8859编码也因UTF的崛起而失宠。在2004年解散的ISO“工作小组”,把维修工作留给了它的家长小组委员会。

#4


13  

ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).

ISO-8859-1是20世纪80年代的遗留标准。它只能代表256个字符,所以只适用于西方世界的某些语言。即使对于许多支持的语言,也缺少一些字符。如果您在这个编码中创建一个文本文件并尝试复制/粘贴一些中文字符,您将看到奇怪的结果。换句话说,不要用它。Unicode已经占据了整个世界,而UTF-8在这些日子里几乎是标准的,除非你有一些遗留的原因(比如HTTP头需要兼容所有的东西)。

#5


3  

My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.

我研究这个问题的原因是,从这个角度看,它们是如何兼容的。Latin1 charset (iso-8859) 100%兼容存储在utf8数据存储中。所有ascii和扩展ascii字符将以单字节的形式存储。

Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

另一种方式,从utf8到Latin1字符集可能工作,也可能不工作。如果有任何2字节的chars(超出扩展的ascii 255),它们将不会存储在Latin1数据存储中。

#6


0  

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.

从另一个角度来看,unicode和ascii编码的文件都不能读取,因为它们有一个字节0xc0,似乎可以正确地读取iso-8859-1。需要注意的是,文件中不应该有unicode字符。