如何用PHP替换字符串中的非SGML字符?

时间:2022-09-13 16:02:44

I programmed a guestbook using PHP4 and HTML 4.01 (with the charset ISO-8859-15, i.e. latin-9). The data is saved in a MySQL-database with the charset (ISO-8859-1, i.e. latin-1).

我使用PHP4和HTML 4.01 (charset ISO-8859-15,即latin-9)编写了一个guestbook。数据保存在带有字符集的mysql数据库中(ISO-8859-1,即latin-1)。

When somebody enters characters from a different charset, it seems that the browsers send the data encoded (actually I have not checked where it gets encoded, ...).

当有人从另一个字符集输入字符时,浏览器似乎会发送编码的数据(实际上我没有检查它在哪里被编码…)。

Anyway, in some cases, it seems that characters are not saved encoded in the database. Thus, the validator returns an error message when I add show the data within an HTML4.01 document:

无论如何,在某些情况下,字符似乎并不保存在数据库中编码。因此,当我添加HTML4.01文档中的数据时,验证器返回一个错误消息:

non SGML character number 146

非SGML字符号146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leaves undefined (among others) 65 character codes (0 to 31 inclusive and 127 to 159 inclusive) that are sometimes used for typographical quote marks and similar in proprietary character sets. The validator has found one of these undefined characters in your document. The character may appear on your browser as a curly quote, or a trademark symbol, or some other fancy glyph; on a different computer, however, it will likely appear as a completely different character, or nothing at all.

你在文本中使用了非法字符。HTML使用标准的UNICODE Consortium (UNICODE Consortium)字符表,它留下65个(其中包括)未定义的字符代码(0到31包含,127到159包含),这些代码有时用于排版引号,在专有字符集中类似。验证器在您的文档中找到了这些未定义字符中的一个。这个字符可能会出现在你的浏览器中,作为一个卷曲的引用,或者一个商标符号,或者其他一些花哨的符号;然而,在另一台电脑上,它可能会以完全不同的字符出现,或者根本不会出现。

Your best bet is to replace the character with the nearest equivalent ASCII character, or to use an appropriate character entity. For more information on Character Encoding on the web, see Alan Flavell's excellent HTML Character Set Issues reference.

最好的办法是用最近的等价ASCII字符替换字符,或者使用适当的字符实体。有关web上字符编码的更多信息,请参见Alan Flavell的优秀HTML字符集发布参考。

This error can also be triggered by formatting characters embedded in documents by some word processors. If you use a word processor to edit your HTML documents, be sure to use the "Save as ASCII" or similar command to save the document without formatting information.

这个错误也可以由一些字处理器对嵌入文档中的字符进行格式化而触发。如果您使用文字处理程序编辑HTML文档,请确保使用“Save as ASCII”或类似的命令来保存文档,而不需要格式化信息。

I'm now using PHP5.2.17, and played a bit with htmlspecialchars, but nothing worked. How can I encode thoses characters, so that there are no more validation errors?

我现在正在使用PHP5.2.17,并在htmlspecialchars上玩了一会儿,但是没有任何功能。如何对这些字符进行编码,从而不再出现验证错误?

2 个解决方案

#1


3  

In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.

在ISO-8859-1和ISO-8859-15中,字符号146是来自C1范围的控制字符MW(消息等待)。

SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

SGML指的是ISO 8859-1(注意ISO和8859-1之间的空间,这不是一个连字符,就像你使用的字符集一样)。它不允许控制字符,只允许三个(这里是HTML中的SGML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

在HTML文档中,只允许设置三个控制字符:水平选项卡、回车和换行(代码位置9、13和10)。

You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.

因此,你通过了一个非法字符。没有一个SGML/HTML实体可以替换它。

I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.

我建议您验证进入应用程序的输入,它不允许控制字符。如果您相信这些字符最初表示的是一个有用的东西,比如一个可以被实际读取的字母(例如,不是一个控制字符),那么很可能在处理数据时,编码在某个时刻被破坏了。

From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).

从您的问题中给出的信息来看,很难说哪里,因为您只指定了输入编码和数据库的编码——但是这两个已经不匹配了(这不应该产生您要问的问题,但是它可以产生其他问题)。除了这两个位置之外,还有数据库客户端连接字符集(在问题中未指定)、输出编码(在问题中未指定)和响应内容编码(在问题中未指定)。

It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.

您可以将总体编码更改为UTF-8以支持更大范围的字符,这可能是有意义的,但这确实是一种可能。

Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)­Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

编辑:上面的部分有点严格。我突然想到你收到的输入不是ISO-8859-1(5),而是别的东西,比如windows代码页。我可能会说,这是windows - 1252(cp1252)­*。与C1范围的ISO-8859-1(128-159)相比,它有几个非控制字符。

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

Wikipedia页面还指出,大多数浏览器将ISO-8859-1视为Windows-1252/CP1252/CP-1252。PHP htmlentities()函数不能处理这些字符,HTML实体的转换表不包含代码点(PHP 5.3,没有针对5.4进行测试)。您需要创建自己的翻译表,并使用它与strtr一起替换windows-1252中ISO 8859-15中没有的字符:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "\x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:

如果您想要更加安全,您可以节省命名实体,只选择那些在非常旧的浏览器中也应该使用的数字。

$cp1252HTMLNumericEntities = array(
    "\x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

希望现在能更有帮助。请参阅上面链接的*页面,了解windows-1242和ISO 8859-15中不同位置的一些字符。你应该考虑在你的网站上使用UTF-8。

#2


2  

A web page that has a text input field should be UTF-8 encoded, because this is the only way to ensure that all characters entered by the user will be correctly transmitted. How you deal with them server-side (e.g., rejecting characters outside some specific range) is a different issue.

具有文本输入字段的web页面应该是UTF-8编码的,因为这是确保用户输入的所有字符将被正确传输的唯一方法。如何在服务器端处理它们(例如,拒绝某些特定范围之外的字符)则是另一个问题。

If you use some other encoding and the user enters a character that has no representation in that encoding, this is an error condition that browsers may handle in any way they like. Modern browsers do something that is very odd in principle though useful in practice: they represent the characters as character references, like ’ for the right single quote (’). In this case, the data received is the same as if the user had actually typed the characters ’ (but this is so theoretical that browser vendors apparently ignore the problem).

如果您使用其他编码,而用户输入的字符在该编码中没有表示,这是浏览器可以以任何他们喜欢的方式处理的错误条件。现代浏览器在原则上做了一些奇怪的事情,但在实践中却很有用:它们将字符表示为字符引用,如’对于正确的单引号(')。在这种情况下,接收到的数据与用户实际输入的字符’相同;(但这是理论上的,浏览器厂商显然忽略了这个问题)。

What happens server-side in your case is unclear, but it may involve many types of processing. In any case, you cannot in general store ISO-8859-15 in ISO-8859-1 encoding (ISO-8859-15 was designed to replace some characters in ISO-8859-1 by other characters). It is unclear what your software does with character references like ’. It would be slightly odd, though surely possible, for software to replace them by character references like ’ (which are based on using windows-1252 as the document character set, contrary to HTML rules; they are technically undefined—not illegal—in HTML but so widely supported by browsers that HTML5 turns this to a rule).

服务器端在您的案例中发生了什么尚不清楚,但它可能涉及许多类型的处理。在任何情况下,您都不能将ISO-8859-15用ISO-8859-1编码(ISO-8859-15被设计为用其他字符替换ISO-8859-1中的某些字符)。现在还不清楚你的软件是如何处理像#8217这样的字符引用的。如果软件用#146之类的字符引用来替换它们,那就有点奇怪了。(基于windows-1252作为文档字符集,违反HTML规则;在HTML中,它们在技术上没有定义——不是非法的——但是被浏览器广泛支持,以至于HTML5把它变成了一个规则)。

#1


3  

In both ISO-8859-1 and ISO-8859-15 the character number 146 is a control character MW (Message Waiting) from the C1 range.

在ISO-8859-1和ISO-8859-15中,字符号146是来自C1范围的控制字符MW(消息等待)。

SGML refers to ISO 8859-1 (mind the space between ISO and 8859-1, which is not a hyphen as in the character sets you use). It does not allow control characters but three (here: SGML in HTML):

SGML指的是ISO 8859-1(注意ISO和8859-1之间的空间,这不是一个连字符,就像你使用的字符集一样)。它不允许控制字符,只允许三个(这里是HTML中的SGML):

In the HTML document character set only three control characters are allowed: Horizontal Tab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

在HTML文档中,只允许设置三个控制字符:水平选项卡、回车和换行(代码位置9、13和10)。

You therefore did pass an illegal character. There does not exist a SGML/HTML entity for it you could replace it with.

因此,你通过了一个非法字符。没有一个SGML/HTML实体可以替换它。

I suggest you validate the input that comes into your application that it does not allow control characters. If you believe those characters were originally representing a useful thing, like a letter that can be actually read (e.g. not a control character), it's likely that when you process the data the encoding is broken at some point.

我建议您验证进入应用程序的输入,它不允许控制字符。如果您相信这些字符最初表示的是一个有用的东西,比如一个可以被实际读取的字母(例如,不是一个控制字符),那么很可能在处理数据时,编码在某个时刻被破坏了。

From the information given in your question it's hard to say where, because you only specify the input encoding and the encoding of the database filed - but those two already don't match (which should not produce the issue you're asking about, but it can produce other issues). Next to those two places, there is also the database client connection charset (unspecified in your question), the output encoding (unspecified in your question) and the response content encoding (unspecified in your question).

从您的问题中给出的信息来看,很难说哪里,因为您只指定了输入编码和数据库的编码——但是这两个已经不匹配了(这不应该产生您要问的问题,但是它可以产生其他问题)。除了这两个位置之外,还有数据库客户端连接字符集(在问题中未指定)、输出编码(在问题中未指定)和响应内容编码(在问题中未指定)。

It might make sense that you change your overall encoding to UTF-8 to support a wider range of characters, but that's really a might.

您可以将总体编码更改为UTF-8以支持更大范围的字符,这可能是有意义的,但这确实是一种可能。

Edit: The part above is somewhat a strict view. It came to my mind that the input you receive is not ISO-8859-1(5) actually but something else, like a windows code page. I'd probably say, it's Windows-1252 (cp1252)­Wikipedia. Compared to the C1 range of ISO-8859-1 (128-159) it has several non-control characters.

编辑:上面的部分有点严格。我突然想到你收到的输入不是ISO-8859-1(5),而是别的东西,比如windows代码页。我可能会说,这是windows - 1252(cp1252)­*。与C1范围的ISO-8859-1(128-159)相比,它有几个非控制字符。

The Wikipedia page also notes that most browsers treat ISO-8859-1 as Windows-1252/CP1252/CP-1252. The PHP htmlentities() function is not able to deal with these characters, the translation table for HTML Entities does not cover the codepoints (PHP 5.3, not tested against 5.4). You need to create your own translation table and use it with strtr to replace the characters not available in ISO 8859-15 for windows-1252:

Wikipedia页面还指出,大多数浏览器将ISO-8859-1视为Windows-1252/CP1252/CP-1252。PHP htmlentities()函数不能处理这些字符,HTML实体的转换表不包含代码点(PHP 5.3,没有针对5.4进行测试)。您需要创建自己的翻译表,并使用它与strtr一起替换windows-1252中ISO 8859-15中没有的字符:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "\x80" => '€',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

If you want to be even more safe, you can spare the named entities and just only pick the numeric ones which should work in very old browsers as well:

如果您想要更加安全,您可以节省命名实体,只选择那些在非常旧的浏览器中也应该使用的数字。

$cp1252HTMLNumericEntities = array(
    "\x80" => '€',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '‚',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => 'ƒ',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '„',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '…',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '†',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '‡',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => 'ˆ',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '‰',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => 'Š',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '‹',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => 'Œ',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => 'Ž',    # 142 -> U+017D
    "\x91" => '‘',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '’',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '“',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '”',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '•',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '–',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '—',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '˜',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '™',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => 'š',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '›',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => 'œ',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => 'ž',    # 158 -> U+017E
    "\x9F" => 'Ÿ',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

Hope this is more helpful now. See as well the Wikipedia page linked above for some characters that are in windows-1242 and ISO 8859-15 but at different points. You should probably consider to use UTF-8 on your website.

希望现在能更有帮助。请参阅上面链接的*页面,了解windows-1242和ISO 8859-15中不同位置的一些字符。你应该考虑在你的网站上使用UTF-8。

#2


2  

A web page that has a text input field should be UTF-8 encoded, because this is the only way to ensure that all characters entered by the user will be correctly transmitted. How you deal with them server-side (e.g., rejecting characters outside some specific range) is a different issue.

具有文本输入字段的web页面应该是UTF-8编码的,因为这是确保用户输入的所有字符将被正确传输的唯一方法。如何在服务器端处理它们(例如,拒绝某些特定范围之外的字符)则是另一个问题。

If you use some other encoding and the user enters a character that has no representation in that encoding, this is an error condition that browsers may handle in any way they like. Modern browsers do something that is very odd in principle though useful in practice: they represent the characters as character references, like ’ for the right single quote (’). In this case, the data received is the same as if the user had actually typed the characters ’ (but this is so theoretical that browser vendors apparently ignore the problem).

如果您使用其他编码,而用户输入的字符在该编码中没有表示,这是浏览器可以以任何他们喜欢的方式处理的错误条件。现代浏览器在原则上做了一些奇怪的事情,但在实践中却很有用:它们将字符表示为字符引用,如’对于正确的单引号(')。在这种情况下,接收到的数据与用户实际输入的字符’相同;(但这是理论上的,浏览器厂商显然忽略了这个问题)。

What happens server-side in your case is unclear, but it may involve many types of processing. In any case, you cannot in general store ISO-8859-15 in ISO-8859-1 encoding (ISO-8859-15 was designed to replace some characters in ISO-8859-1 by other characters). It is unclear what your software does with character references like ’. It would be slightly odd, though surely possible, for software to replace them by character references like ’ (which are based on using windows-1252 as the document character set, contrary to HTML rules; they are technically undefined—not illegal—in HTML but so widely supported by browsers that HTML5 turns this to a rule).

服务器端在您的案例中发生了什么尚不清楚,但它可能涉及许多类型的处理。在任何情况下,您都不能将ISO-8859-15用ISO-8859-1编码(ISO-8859-15被设计为用其他字符替换ISO-8859-1中的某些字符)。现在还不清楚你的软件是如何处理像#8217这样的字符引用的。如果软件用#146之类的字符引用来替换它们,那就有点奇怪了。(基于windows-1252作为文档字符集,违反HTML规则;在HTML中,它们在技术上没有定义——不是非法的——但是被浏览器广泛支持,以至于HTML5把它变成了一个规则)。