哪些是HTML / XHTML表单中的有效控制字符

I'm tring to create form validation unit that, in addition to "regular" tests checks encoding as well.

我想创建表单验证单元,除了“常规”测试之外还检查编码。

According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.

根据这篇文章http://www.w3.org/International/questions/qa-forms-utf-8,允许的字符是CR,LF和TAB,范围为0-31,DEL = 127不允许。

On the other hand, there are control characters in range 0x80-0xA0. In different sources I had seen that they are allowed and that not. Also I had seen that this is different for XHTML, HTML and XML.

另一方面,控制字符在0x80-0xA0范围内。在不同的来源,我看到他们被允许,而不是。我也看到过这与XHTML,HTML和XML不同。

Some articles had told that FF is allowed as well?

有些文章告诉过FF也是允许的吗?

Can someone provide a good answer with sources what can be given and what isn't?

有人可以提供一个很好的答案来源可以给予什么,什么不是?

EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity

编辑:即使在那里http://www.w3.org/International/questions/qa-controls有些含糊不清

The C1 range is supported

支持C1范围

But table shows that they are illegal and previous shown UTF-8 validations allows them?

但是表格显示它们是非法的,之前显示的UTF-8验证允许它们吗?

8 个解决方案

#1

The Unicode characters in these ranges are valid in HTML 4.01:

这些范围中的Unicode字符在HTML 4.01中有效:

0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF

In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258

在XHTML 1.0中......目前还不清楚。请参见http://cmsmcq.com/2007/C1.xml#o127626258

#2

I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":

我认为你正在以错误的方式看待这个问题。您链接的资源指定哪些编码值在(X)HTML中有效,但听起来您想要验证来自Web表单的“响应” - 例如,各种表单控件的值,传递回您的服务器。在这种情况下,您不应该查看(X)HTML中的有效内容,而是在application / x-www-form-urlencoded中有效,以及可能还有multipart / form-data,MIME类型。

This is the default content type. Forms submitted with this content type must be encoded as follows:

这是默认的内容类型。使用此内容类型提交的表单必须按如下方式编码:

Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').

控制名称和值将被转义。空格字符由'+'替换,然后保留字符被转义,如[RFC1738],第2.2节中所述:非字母数字字符由'%HH'替换,百分号和两个十六进制数字代表ASCII代码字符。换行符表示为“CR LF”对(即,'%0D%0A')。

The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

控件名称/值按它们在文档中出现的顺序列出。名称通过'='与值分隔,名称/值对通过'&'彼此分隔。

As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.

至于包含什么字符编码,(即%A0是不间断空格还是错误),由

#3

Postel's Law: Be conservative in what you do; be liberal in what you accept from others.

Postel定律:保守你所做的事;你从别人那里接受*。

If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.

如果您要为其他人生成文档以供阅读,则应避免/逃避所有控制字符,即使它们在技术上是合法的。如果您正在解析文档,即使它们在技术上是非法的,您也应该尽力接受所有控制字符。

#4

First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.

首先,任何八位字节都有效。所提到的UTF-8序列的正则表达式省略了其中的一些,因为它们在实践中很少被用户输入。但这并不意味着它们无效。它们不会发生。

#5

The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.

您提到的第一个链接与验证XHTML中允许的字符没有任何关系......该链接上的示例仅显示用于检测原始数据是否采用utf-8编码的公共/通用模式。

This is a quote from the second link:

这是第二个链接的引用:

HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References).

除HT(水平制表)U + 0009,LF(换行)U + 000A和CR(回车)U + 000D外,HTML,XHTML和XML 1.0不支持C0范围。支持C1范围,即您可以直接编码控件或将它们表示为NCR(数字字符参考)。

The way I read this is:

我读这个的方式是:

Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.

如果对它们进行编码(使用base64或Hex表示)或将它们表示为NCR,则支持C1范围内的任何控制字符。

Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.

C0范围内仅支持U + 0009,U + 000A和U + 000D。不能表示该范围内的其他控制代码。

#6

If the document is known to be XHTML, then you should just load it and validate it against the schema.

如果已知文档是XHTML,那么您应该只加载它并根据模式对其进行验证。

#7

What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.

你用什么编程语言?至少对于Java,存在用于检查字符串(或字节数组)的编码的库。我猜其他语言也会存在类似的库。

#8

Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?

我是否正确理解您的问题:您想检查表单提交的数据是否有效且是否正确编码?

If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.

如果是这样,为什么要一次做几件事呢?将这些检查分开并逐步执行它们会更加容易,恕我直言。

You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.

您想检查提交的表单数据是否已正确编码(我收集的是UTF-8)。正如Archchancellor Ridcully所说,在大多数语言中都很容易检查。

Then, if the encoding is correct, you can check whether it's valid form data.

然后,如果编码正确,您可以检查它是否是有效的表单数据。

Then, if the form data is valid, you can check whether the data contains what you expect.

然后,如果表单数据有效,您可以检查数据是否包含您期望的内容。

#1