将带有中文字符的XML发布到Microsoft Translator API会引发反序列化异常

时间:2022-10-17 11:10:30

I'm trying to translate from Chinese (Simplified) to English using the Microsoft Translator API.

我正在尝试使用Microsoft Translator API将中文(简体)翻译成英文。

A couple of requirements

一些要求

  • I must use the HTTP method POST, and not GET with a query string because my queries exceed Microsoft's URI limit of 15,845 characters (note that this is possible even when I use less than the 10,000 characters limit in the case of Chinese characters. The reason is that the query string has to be URL encoded, which dramatically increases the length, but it is decoded by Microsoft before the character count is determined.

    我必须使用HTTP方法POST,而不是使用查询字符串GET,因为我的查询超过了Microsoft的URI限制15,845个字符(请注意,即使我在中文字符的情况下使用少于10,000个字符的限制,这也是可能的。是查询字符串必须进行URL编码,这会显着增加长度,但在确定字符数之前由Microsoft解码。

  • The only translate HTTP method that allows POSTs is the TranslateArrayMethod, e.g. the TranslateMethod only allows GETs. Unfortunately, the TranslateArrayMethod only accepts an XML document, so I must work with XML.

    允许POST的唯一转换HTTP方法是TranslateArrayMethod,例如TranslateMethod只允许GET。不幸的是,TranslateArrayMethod只接受XML文档,所以我必须使用XML。

The following is an example of an XML document that I am sending:

以下是我发送的XML文档的示例:

<TranslateArrayRequest>
    <AppId/>
    <From>es</From>
    <Options>
        <ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
    </Options>
    <Texts>
        <string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
        <![CDATA[Hola]]>
        </string>
    </Texts>
    <To>en</To>
</TranslateArrayRequest>

This works fine, the result is:

这很好,结果是:

<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
    <From>es</From>
    <OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
    <a:int>4</a:int>
</OriginalTextSentenceLengths>
<TranslatedText>Hello</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:int>5</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
</ArrayOfTranslateArrayResponse>

However, if I then add any Chinese character, like so:

但是,如果我然后添加任何中文字符,如下所示:

<TranslateArrayRequest>
    <AppId/>
    <From>zh-CHS</From>
    <Options>
        <ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
    </Options>
    <Texts>
        <string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
        <![CDATA[南]]>
        </string>
    </Texts>
    <To>en</To>
</TranslateArrayRequest>

I get a weird response:

我得到一个奇怪的回应:

<html>
    <body/>
    <h1>System.Runtime.Serialization.SerializationException</h1>
    <p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 298.</p>
</html>

Note that I also tried not using CDATA escaping, but it doesn't help. Changing the From language has no effect either.

请注意,我也尝试过不使用CDATA转义,但它没有帮助。更改From语言也没有任何效果。

I'm working with Node.js (Javascript), although since this is a generic HTTP API I don't think that should matter.

我正在使用Node.js(Javascript),虽然这是一个通用的HTTP API我不认为这应该重要。

2 个解决方案

#1


1  

OK, I encountered exactly the same problem calling one of the Microsoft Translator POST APIs from Node.js. The API works fine - returns the translation as expected - as long as there are no non-ASCII characters, but then when I add a single accented 'é' character to the in appropriate <string> section of the POST body, it responds with an error:

好的,我从Node.js调用一个Microsoft Translator POST API时遇到了同样的问题。 API工作正常 - 只要没有非ASCII字符就返回翻译 - 但是当我在POST主体的相应 部分添加一个带有重音的'é'字符时,它会响应一个错误:

    <html><body/><h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 782.</p>
</html>

I figured out that the problem is that the Content-Length header wants the length in bytes, but I had been sending the length in characters. Why does this happen? Well, the typical way to measure the length of the body for the Node http request is to call

我发现问题是Content-Length标头需要以字节为单位的长度,但我一直在发送字符长度。为什么会这样?那么,测量Node http请求的主体长度的典型方法是调用

var length = body.length

and get the "length" - i.e. number of characters - of the string. This works when all of the characters are ASCII. However, it turns out that in UTF-8 non-ASCII characters (including my accented 'é') can be more than one byte each. So when the body contains non-ASCII characters the byte length will no longer be equal to the character length, and the character length is incorrect. In this case, it caused the Microsoft server to stop reading the message prematurely, generating the error message.

并获得字符串的“长度” - 即字符数。这适用于所有字符都是ASCII的情况。然而,事实证明,在UTF-8中,非ASCII字符(包括我的重音'é')每个字符可以超过一个字节。因此,当正文包含非ASCII字符时,字节长度将不再等于字符长度,并且字符长度不正确。在这种情况下,它导致Microsoft服务器过早地停止读取消息,生成错误消息。

Instead we need to measure the length in bytes with the call (in Node.js)

相反,我们需要通过调用来测量字节长度(在Node.js中)

var length = Buffer.byteLength(body, 'utf8')

and send that length in Content-Length header, and the Microsoft Translator API works again.

并在Content-Length标头中发送该长度,Microsoft Translator API再次运行。

#2


1  

Most probably, the problem is not the Chinese language, but that MS Translator doesn't like new line symbols. When I stumbled into this error message, I've changed following:

最有可能的问题不是中文,而是MS Translator不喜欢新的线符号。当我偶然发现此错误消息时,我改变了以下内容:

  1. In every content of <string> node replaced new line characters with empty string. These characters have Unicode values: 0xA, 0xB, 0xC, 0xD, 0x85, 0x2028, 0x2029
  2. 节点的每个内容中,用空字符串替换新行字符。这些字符具有Unicode值:0xA,0xB,0xC,0xD,0x85,0x2028,0x2029

  3. In every content of <string> node replaced XML reserved words with their alternative representation:

    节点的每个内容中,用其替代表示替换了XML保留字:

    & → &amp;

    &→&

    < → &lt;

    <→<

    > → &gt;

    >→>

    ' → &apos;

    '→'

    " → &quot;

    “→”

  4. Rearranged the entire XML into single line
  5. 将整个XML重新排列为单行

After that, all worked smoothly. Concerning your particular example, the symbol "南" was translated as "South". I didn't use CDATA escaping.

之后,一切顺利。关于您的特定示例,符号“南”被翻译为“南”。我没有使用CDATA转义。

#1


1  

OK, I encountered exactly the same problem calling one of the Microsoft Translator POST APIs from Node.js. The API works fine - returns the translation as expected - as long as there are no non-ASCII characters, but then when I add a single accented 'é' character to the in appropriate <string> section of the POST body, it responds with an error:

好的,我从Node.js调用一个Microsoft Translator POST API时遇到了同样的问题。 API工作正常 - 只要没有非ASCII字符就返回翻译 - 但是当我在POST主体的相应 部分添加一个带有重音的'é'字符时,它会响应一个错误:

    <html><body/><h1>System.Runtime.Serialization.SerializationException</h1>
<p>Message: There was an error deserializing the object of type Microsoft.MT.MDistributor.V2.TranslateArrayRequest. Unexpected end of file. Following elements are not closed: TranslateArrayRequest. Line 1, position 782.</p>
</html>

I figured out that the problem is that the Content-Length header wants the length in bytes, but I had been sending the length in characters. Why does this happen? Well, the typical way to measure the length of the body for the Node http request is to call

我发现问题是Content-Length标头需要以字节为单位的长度,但我一直在发送字符长度。为什么会这样?那么,测量Node http请求的主体长度的典型方法是调用

var length = body.length

and get the "length" - i.e. number of characters - of the string. This works when all of the characters are ASCII. However, it turns out that in UTF-8 non-ASCII characters (including my accented 'é') can be more than one byte each. So when the body contains non-ASCII characters the byte length will no longer be equal to the character length, and the character length is incorrect. In this case, it caused the Microsoft server to stop reading the message prematurely, generating the error message.

并获得字符串的“长度” - 即字符数。这适用于所有字符都是ASCII的情况。然而,事实证明,在UTF-8中,非ASCII字符(包括我的重音'é')每个字符可以超过一个字节。因此,当正文包含非ASCII字符时,字节长度将不再等于字符长度,并且字符长度不正确。在这种情况下,它导致Microsoft服务器过早地停止读取消息,生成错误消息。

Instead we need to measure the length in bytes with the call (in Node.js)

相反,我们需要通过调用来测量字节长度(在Node.js中)

var length = Buffer.byteLength(body, 'utf8')

and send that length in Content-Length header, and the Microsoft Translator API works again.

并在Content-Length标头中发送该长度,Microsoft Translator API再次运行。

#2


1  

Most probably, the problem is not the Chinese language, but that MS Translator doesn't like new line symbols. When I stumbled into this error message, I've changed following:

最有可能的问题不是中文,而是MS Translator不喜欢新的线符号。当我偶然发现此错误消息时,我改变了以下内容:

  1. In every content of <string> node replaced new line characters with empty string. These characters have Unicode values: 0xA, 0xB, 0xC, 0xD, 0x85, 0x2028, 0x2029
  2. 节点的每个内容中,用空字符串替换新行字符。这些字符具有Unicode值:0xA,0xB,0xC,0xD,0x85,0x2028,0x2029

  3. In every content of <string> node replaced XML reserved words with their alternative representation:

    节点的每个内容中,用其替代表示替换了XML保留字:

    & → &amp;

    &→&

    < → &lt;

    <→<

    > → &gt;

    >→>

    ' → &apos;

    '→'

    " → &quot;

    “→”

  4. Rearranged the entire XML into single line
  5. 将整个XML重新排列为单行

After that, all worked smoothly. Concerning your particular example, the symbol "南" was translated as "South". I didn't use CDATA escaping.

之后,一切顺利。关于您的特定示例,符号“南”被翻译为“南”。我没有使用CDATA转义。