Character Encoding in .NET

时间:2021-12-29 08:06:24

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#Encodings

Characters are abstract entities that can be represented in many different ways. A character encoding is a system that pairs each character in a supported character set with some value that represents that character. For example, Morse code is a character encoding that pairs each character in the Roman alphabet with a pattern of dots and dashes that are suitable for transmission over telegraph电报 lines. A character encoding for computers pairs each character in a supported character set with a numeric value that represents that character. A character encoding has two distinct components:

  • An encoder, which translates a sequence of characters into a sequence of numeric values (bytes).

  • A decoder, which translates a sequence of bytes into a sequence of characters.

Character encoding describes the rules by which an encoder and a decoder operate.

For example, the UTF8Encoding class describes the rules for encoding to, and decoding from, 8-bit Unicode Transformation Format (UTF-8), which uses one to four bytes to represent a single Unicode character. Encoding and decoding can also include validation.

For example, the UnicodeEncoding class checks all surrogates to make sure they constitute valid surrogate pairs. (A surrogate pair consists of a character with a code point that ranges from U+D800 to U+DBFF followed by a character with a code point that ranges from U+DC00 to U+DFFF.) A fallback strategy determines how an encoder handles invalid characters or how a decoder handles invalid bytes.

.NET uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and strings. Applications that target the common language runtime use encoders to map Unicode character representations supported by the common language runtime to other encoding schemes. They use decoders to map characters from non-Unicode encodings to Unicode.

This topic consists of the following sections:

Encodings in .NET

All character encoding classes in .NET inherit from the System.Text.Encoding class, which is an abstract class that defines the functionality common to all character encodings. To access the individual encoding objects implemented in .NET, do the following:

  • Use the static properties of the Encoding class, which return objects that represent the standard character encodings available in .NET (ASCII, UTF-7, UTF-8, UTF-16, and UTF-32). For example, the Encoding.Unicode property returns a UnicodeEncoding object. Each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode. (For more information, see the Replacement Fallback section.)

  • Call the encoding's class constructor. Objects for the ASCII, UTF-7, UTF-8, UTF-16, and UTF-32 encodings can be instantiated in this way. By default, each object uses replacement fallback to handle strings that it cannot encode and bytes that it cannot decode, but you can specify that an exception should be thrown instead. (For more information, see the Replacement Fallback and Exception Fallback sections.)

  • Call the Encoding.Encoding(Int32) constructor and pass it an integer that represents the encoding. Standard encoding objects use replacement fallback, and code page and double-byte character set (DBCS) encoding objects use best-fit fallback to handle strings that they cannot encode and bytes that they cannot decode. (For more information, see the Best-Fit Fallback section.)

  • Call the Encoding.GetEncoding method, which returns any standard, code page, or DBCS encoding available in .NET. Overloads let you specify a fallback object for both the encoder and the decoder.

You can retrieve information about all the encodings available in .NET by calling the Encoding.GetEncodings method. .NET supports the character encoding systems listed in the following table.

Encoding Class Description Advantages/disadvantages
ASCII ASCIIEncoding Encodes a limited range of characters by using the lower seven bits of a byte. Because this encoding only supports character values from U+0000 through U+007F, in most cases it is inadequate for internationalized applications.
UTF-7 UTF7Encoding Represents characters as sequences of 7-bit ASCII characters. Non-ASCII Unicode characters are represented by an escape sequence of ASCII characters. UTF-7 supports protocols such as email and newsgroup protocols. However, UTF-7 is not particularly secure or robust. In some cases, changing one bit can radically alter the interpretation of an entire UTF-7 string. In other cases, different UTF-7 strings can encode the same text. For sequences that include non-ASCII characters, UTF-7 requires more space than UTF-8, and encoding/decoding is slower. Consequently, you should use UTF-8 instead of UTF-7 if possible.
UTF-8 UTF8Encoding Represents each Unicode code point as a sequence of one to four bytes. UTF-8 supports 8-bit data sizes and works well with many existing operating systems. For the ASCII range of characters, UTF-8 is identical to ASCII encoding and allows a broader set of characters. However, for Chinese-Japanese-Korean (CJK) scripts, UTF-8 can require three bytes for each character, and can potentially cause larger data sizes than UTF-16. Note that sometimes the amount of ASCII data, such as HTML tags, justifies the increased size for the CJK range.
UTF-16 UnicodeEncoding Represents each Unicode code point as a sequence of one or two 16-bit integers. Most common Unicode characters require only one UTF-16 code point, although Unicode supplementary characters (U+10000 and greater) require two UTF-16 surrogate code points. Both little-endian and big-endian byte orders are supported. UTF-16 encoding is used by the common language runtime to represent Char and String values, and it is used by the Windows operating system to represent WCHAR values.
UTF-32 UTF32Encoding Represents each Unicode code point as a 32-bit integer. Both little-endian and big-endian byte orders are supported. UTF-32 encoding is used when applications want to avoid the surrogate code point behavior of UTF-16 encoding on operating systems for which encoded space is too important. Single glyphs rendered on a display can still be encoded with more than one UTF-32 character.
ANSI/ISO encodings   Provides support for a variety of code pages. On Windows operating systems, code pages are used to support a specific language or group of languages. For a table that lists the code pages supported by .NET, see the Encoding class. You can retrieve an encoding object for a particular code page by calling the Encoding.GetEncoding(Int32) method. A code page contains 256 code points and is zero-based. In most code pages, code points 0 through 127 represent the ASCII character set, and code points 128 through 255 differ significantly between code pages. For example, code page 1252 provides the characters for Latin writing systems, including English, German, and French. The last 128 code points in code page 1252 contain the accent characters. Code page 1253 provides character codes that are required in the Greek writing system. The last 128 code points in code page 1253 contain the Greek characters. As a result, an application that relies on ANSI code pages cannot store Greek and German in the same text stream unless it includes an identifier that indicates the referenced code page.
Double-byte character set (DBCS) encodings   Supports languages, such as Chinese, Japanese, and Korean, that contain more than 256 characters. In a DBCS, a pair of code points (a double byte) represents each character. The Encoding.IsSingleByte property returns false for DBCS encodings. You can retrieve an encoding object for a particular DBCS by calling the Encoding.GetEncoding(Int32) method. In a DBCS, a pair of code points (a double byte) represents each character. When an application handles DBCS data, the first byte of a DBCS character (the lead byte) is processed in combination with the trail byte that immediately follows it. Because a single pair of double-byte code points can represent different characters depending on the code page, this scheme still does not allow for the combination of two languages, such as Japanese and Chinese, in the same data stream.

These encodings enable you to work with Unicode characters as well as with encodings that are most commonly used in legacy applications. In addition, you can create a custom encoding by defining a class that derives from Encoding and overriding its members.

Character Encoding in .NET的更多相关文章

  1. 使用英文版eclipse保存代码,出现some characters cannot be mapped using "Cp1251" character encoding.

    some characters cannot be mapped using "Cp1251" character encoding. 解决办法:方案一: eclipse-> ...

  2. Character Encoding tomcat.

    default character encoding of the request or response body: If a character encoding is not specified ...

  3. A - Character Encoding HDU - 6397 - 方程整数解-容斥原理

    A - Character Encoding HDU - 6397 思路 : 隔板法就是在n个元素间的(n-1)个空中插入k-1个板,可以把n个元素分成k组的方法 普通隔板法 求方程 x+y+z=10 ...

  4. java.sql.SQLException: Unsupported character encoding 'utf8mb4'.

    四月 12, 2017 3:47:52 下午 org.apache.catalina.core.StandardWrapperValve invoke 严重: Servlet.service() fo ...

  5. hdu6397 Character Encoding 隔板法+容斥原理+线性逆元方程

    题目传送门 题意:给出n,m,k,用m个0到n-1的数字凑出k,问方案数,mod一个值. 题目思路: 首先如果去掉数字范围的限制,那么就是隔板法,先复习一下隔板法. ①k个相同的小球放入m个不同的盒子 ...

  6. some characters cannot be mapped using iso-8859-1 character encoding

    Eclipse中新建一个.properties文件,如果输入中文保存时就会提示错误 Reason:some characters cannot be mapped using "ISO-88 ...

  7. hdu 6397 Character Encoding (生成函数)

    Problem Description In computer science, a character is a letter, a digit, a punctuation mark or som ...

  8. zend studio打开文件提示unsupported character encoding

    zend studio打开文件提示unsupported character encoding,是文件的编码方式错误. 有可能是PHP代码中,charset={CHARSET} ,用了变量的形式调用编 ...

  9. 使用tomcat运行时提示some characters cannot be mapped using iso-8859-1 character encoding异常

    今天第一次使用java进行jsp项目搭建,也是第一次使用tomcat.tomcat是运行java web的一个小型服务器,属于Apache的一个开源免费的服务. 在运行web 的时候,我们就要先配置好 ...

随机推荐

  1. Chrome console命令整理

    console.dir (这个方法是我经常使用的 可不知道比for in方便了多少) 直接将该DOM结点以DOM树的结构进行输出,可以详细查对象的方法发展等等 在页面右击选择 审查元素 ,然后在弹出来 ...

  2. 第五百八十四天 how can I 坚持

    好累啊,不知道怎么搞的,也没干啥啊,学习学的..不会吧. 哎. 吃粉蒸吃的都快吃吐了,得休息段时间,哈哈 没想象中的那么简单啊,太难了,哎,怎么考,压力啊.哎,也没啥. 洗澡睡觉,先休息.

  3. servlet&jsp高级:第二部分

    声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...

  4. 水灾(sliker.cpp/c/pas) 1000MS 64MB

    大雨应经下了几天雨,却还是没有停的样子.土豪CCY刚从外地赚完1e元回来,知道不久除了自己别墅,其他的地方都将会被洪水淹没. CCY所在的城市可以用一个N*M(N,M<=50)的地图表示,地图上 ...

  5. logstash &lpar;&quest;m&rpar; 经典例子

    在和 codec/multiline 搭配使用的时候,需要注意一个问题,grok 正则和普通正则一样,默认是不支持匹配回车换行的.就像你需要 =~ //m 一样也需要单独指定,具体写法是在表达式开始位 ...

  6. Appium-Python-Windows环境搭建笔记

    Appium版本:1.11.0 操作系统:Windows7-64位 开发语言:Python 3.7.2 测试应用平台:安卓 5.1.1 Appium服务端 一.JDK 也许你会觉得很奇怪,我搭建Pyt ...

  7. C&num; 线程获取&sol;设置控件&lpar;TextBox&rpar;值

    线程读写控件需要用委托(delegate)与Invoke/BeginInvoke来进行 参考内容:http://www.cnblogs.com/runner/archive/2011/12/30/23 ...

  8. JEECG平台权限设计

    JEECG平台权限设计 链接存放位置:https://github.com/PlayTaoist/jeecg-lession/tree/master/%E6%9D%83%E9%99%90%E7%AE% ...

  9. 网易免费企业邮箱Foxmail设置方法

    网易免费企业邮箱Foxmail7.0设置方法 第一步:启动 Foxmail 邮件客户端,点击工具->账号管理,弹出如下页面. 点击新建,如下: 填写自己企业邮箱账号,然后下一步,邮箱类型选择PO ...

  10. CSS居中问题:块级元素和行级元素在水平方向以及垂直方向的居中问题

    元素的居中问题是每个初学者碰到的第一个大问题,在此我总结了下各种块级 行级 水平 垂直 的居中方法,并尽量给出代码实例. 首先请先明白块级元素和行级元素的区别 块级元素 块级元素水平居中 1:marg ...