Java应用程序在特殊字符上失败

时间:2022-01-22 00:26:25

An application I am working on reads information from files to populate a database. Some of the characters in the files are non-English, for example accented French characters.

我正在处理的应用程序从文件中读取信息以填充数据库。文件中的某些字符是非英语字符,例如带重音的法语字符。

The application is working fine in Windows but on our Solaris machine it is failing to recognise the special characters and is throwing an exception. For example when it encounters the accented e in "Gérer" it says :-

该应用程序在Windows中运行良好,但在我们的Solaris机器上,它无法识别特殊字符并抛出异常。例如,当它遇到“Gérer”中的重音e时,它说: -

      Encountered: "\u0161" (353), after : "\'G\u00c3\u00a9rer les mod\u00c3"

(an exception which is thrown from our application)

(从我们的应用程序中抛出的异常)

I suspect that in order to stop this from happening I need to change the file.encoding property of the JVM. I tried to do this via System.setProperty() but it has not stopped the error from occurring.

我怀疑为了阻止这种情况发生,我需要更改JVM的file.encoding属性。我尝试通过System.setProperty()执行此操作,但它没有停止发生错误。

Are there any suggestions for what I could do? I was thinking about setting the basic locale of the solaris platform in /etc/default/init to be UTF-8. Does anyone think this might help?

对我能做什么有什么建议吗?我在考虑将/ etc / default / init中solaris平台的基本语言环境设置为UTF-8。有人认为这可能会有所帮助吗?

Any thoughts are much appreciated.

任何想法都非常感激。

8 个解决方案

#1


4  

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents

这看起来像是由native2ascii使用错误参数转换的文件。要演示,请使用内容创建文件

Gérer les modÚ

and save it as "a.txt" with the encoding UTF-8. Then run this command:

并使用编码UTF-8将其另存为“a.txt”。然后运行以下命令:

native2ascii -encoding windows-1252 a.txt b.txt

Open the new file and you should see this:

打开新文件,您应该看到:

G\u00c3\u00a9rer les mod\u00c3\u0161

Now reverse the process, but specify ISO-8859-1 this time:

现在反转过程,但这次指定ISO-8859-1:

native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt

Read the new file as UTF-8 and you should see this:

将新文件读为UTF-8,您应该看到:

Gérer les modÀ\u0161

It recovers the "é" okay, but chokes on the "Ú", like your app did.

它恢复了“é”没关系,但是就像你的应用程序那样窒息了“Ú”。

I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

我不知道你的应用程序出了什么问题,但我很确定不正确使用native2ascii是其中的一部分。这可能是让应用程序使用系统默认编码的结果。您应该始终在保存文本时指定编码,无论是文件还是数据库,还是什么 - 永远不要让它默认。如果您没有充分的理由选择其他内容,请使用UTF-8。

#2


2  

Try to use

尝试使用

java -Dfile.encoding=UTF-8 ...

when starting the application in both systems.

在两个系统中启动应用程序时。

Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).

解决该问题的另一种方法是将编码从两个系统更改为UTF-8,但我更喜欢第一个选项(对系统的干扰较小)。

EDIT:

Check this answer on *, It might help either:

在*上检查这个答案,它可能会有所帮助:

Changing the default encoding for String(byte[])

更改String的默认编码(byte [])

#3


1  

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.

在读取和写入特定文本数据时,指定字符编码可能更容易,也更健壮,而不是设置系统范围的字符编码。您的应用程序如何读取文件?所有Java I / O包读取器和写入器都支持传入字符编码名称,以便在向/从字节读取/写入文本时使用。如果您没有指定一个,那么它将使用您可能遇到的平台默认编码。

Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.

一些数据库在他们可以接受的文本编码方面令人惊讶地受到限制。如果您的Java应用程序以正确的编码将文件读取为文本,那么它可以将其输出到数据库,但它需要它。如果您的数据库不支持其字符repetoire包含您拥有的非ASCII字符的任何编码,那么您可能需要首先编码非英文文本,例如编码为UTF-8字节,然后Base64将这些字节编码为ASCII文本。

PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

PS:永远不要使用没有字符编码参数的String.getBytes(),这完全是你看到的原因。

#4


1  

I managed to get past this error by running the command

我设法通过运行命令来解决此错误

export LC_ALL='en_GB.UTF-8'

This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.

此命令设置我所在的shell的语言环境。这会将所有LC_环境变量设置为Unicode文件编码。

Many thanks for all of your suggestions.

非常感谢你的所有建议。

#5


0  

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

您也可以在命令行设置编码,如java -Dfile.encoding = utf-8。

#6


0  

I think we'll need more information to be able to help you with your problem:

我想我们需要更多信息才能帮助您解决问题:

  1. What exception are you getting exactly, and which method are you calling when it occurs.
  2. 你得到了什么例外,以及当它发生时你调用哪种方法。

  3. What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
  4. 输入文件的编码是什么? UTF8? UTF-16 / Unicode的? ISO8859-1?

It'll also be helpful if you could provide us with relevant code snippets.

如果您能为我们提供相关的代码段,这也会很有帮助。

Also, a few things I want to point out:

另外,我想指出的一些事情:

  1. The problem isn't occurring at the 'é' but later on.
  2. 问题不在'é',而是在以后发生。

  3. It sounds like the character encoding may be hard coded in your application somewhere.
  4. 听起来字符编码可能会在您的应用程序中进行硬编码。

#7


0  

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

此外,您可能需要验证是否已安装支持UTF-8(SUNWeulux,SUNWeuluf等)的操作系统软件包。

#8


0  

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.

Java在读取和写入文件时使用操作系统的默认编码。现在,人们永远不应该依赖于此。明确指定编码始终是一个好习惯。

In Java you can use following for reading and writing:

在Java中,您可以使用以下内容进行读写:

Reading:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));

Writing:

PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));

#1


4  

That looks like a file that was converted by native2ascii using the wrong parameters. To demonstrate, create a file with the contents

这看起来像是由native2ascii使用错误参数转换的文件。要演示,请使用内容创建文件

Gérer les modÚ

and save it as "a.txt" with the encoding UTF-8. Then run this command:

并使用编码UTF-8将其另存为“a.txt”。然后运行以下命令:

native2ascii -encoding windows-1252 a.txt b.txt

Open the new file and you should see this:

打开新文件,您应该看到:

G\u00c3\u00a9rer les mod\u00c3\u0161

Now reverse the process, but specify ISO-8859-1 this time:

现在反转过程,但这次指定ISO-8859-1:

native2ascii -reverse -encoding ISO-8859-1 b.txt c.txt

Read the new file as UTF-8 and you should see this:

将新文件读为UTF-8,您应该看到:

Gérer les modÀ\u0161

It recovers the "é" okay, but chokes on the "Ú", like your app did.

它恢复了“é”没关系,但是就像你的应用程序那样窒息了“Ú”。

I don't know what all is going wrong in your app, but I'm pretty sure incorrect use of native2ascii is part of it. And that was probably the result of letting the app use the system default encoding. You should always specify the encoding when you save text, whether it's to a file or a database or what--never let it default. And if you don't have a good reason to choose something else, use UTF-8.

我不知道你的应用程序出了什么问题,但我很确定不正确使用native2ascii是其中的一部分。这可能是让应用程序使用系统默认编码的结果。您应该始终在保存文本时指定编码,无论是文件还是数据库,还是什么 - 永远不要让它默认。如果您没有充分的理由选择其他内容,请使用UTF-8。

#2


2  

Try to use

尝试使用

java -Dfile.encoding=UTF-8 ...

when starting the application in both systems.

在两个系统中启动应用程序时。

Another way to solve the problem is to change the encoding from both system to UTF-8, but i prefer the first option (less intrusive on the system).

解决该问题的另一种方法是将编码从两个系统更改为UTF-8,但我更喜欢第一个选项(对系统的干扰较小)。

EDIT:

Check this answer on *, It might help either:

在*上检查这个答案,它可能会有所帮助:

Changing the default encoding for String(byte[])

更改String的默认编码(byte [])

#3


1  

Instead of setting the system-wide character encoding, it might be easier and more robust, to specify the character encoding when reading and writing specific text data. How is your application reading the files? All the Java I/O package readers and writers support passing in a character encoding name to be used when reading/writing text to/from bytes. If you don't specify one, it will then use the platform default encoding, as you are likely experiencing.

在读取和写入特定文本数据时,指定字符编码可能更容易,也更健壮,而不是设置系统范围的字符编码。您的应用程序如何读取文件?所有Java I / O包读取器和写入器都支持传入字符编码名称,以便在向/从字节读取/写入文本时使用。如果您没有指定一个,那么它将使用您可能遇到的平台默认编码。

Some databases are surprisingly limited in the text encodings they can accept. If your Java application reads the files as text, in the proper encoding, then it can output it to the database however it needs it. If your database doesn't support any encoding whose character repetoire includes the non-ASCII characters you have, then you may need to encode your non-English text first, for example into UTF-8 bytes, then Base64 encode those bytes as ASCII text.

一些数据库在他们可以接受的文本编码方面令人惊讶地受到限制。如果您的Java应用程序以正确的编码将文件读取为文本,那么它可以将其输出到数据库,但它需要它。如果您的数据库不支持其字符repetoire包含您拥有的非ASCII字符的任何编码,那么您可能需要首先编码非英文文本,例如编码为UTF-8字节,然后Base64将这些字节编码为ASCII文本。

PS: Never use String.getBytes() with no character encoding argument for exactly the reasons you are seeing.

PS:永远不要使用没有字符编码参数的String.getBytes(),这完全是你看到的原因。

#4


1  

I managed to get past this error by running the command

我设法通过运行命令来解决此错误

export LC_ALL='en_GB.UTF-8'

This command set the locale for the shell that I was in. This set all of the LC_ environment variables to the Unicode file encoding.

此命令设置我所在的shell的语言环境。这会将所有LC_环境变量设置为Unicode文件编码。

Many thanks for all of your suggestions.

非常感谢你的所有建议。

#5


0  

You can also set the encoding at the command line, like so java -Dfile.encoding=utf-8.

您也可以在命令行设置编码,如java -Dfile.encoding = utf-8。

#6


0  

I think we'll need more information to be able to help you with your problem:

我想我们需要更多信息才能帮助您解决问题:

  1. What exception are you getting exactly, and which method are you calling when it occurs.
  2. 你得到了什么例外,以及当它发生时你调用哪种方法。

  3. What is the encoding of the input file? UTF8? UTF16/Unicode? ISO8859-1?
  4. 输入文件的编码是什么? UTF8? UTF-16 / Unicode的? ISO8859-1?

It'll also be helpful if you could provide us with relevant code snippets.

如果您能为我们提供相关的代码段,这也会很有帮助。

Also, a few things I want to point out:

另外,我想指出的一些事情:

  1. The problem isn't occurring at the 'é' but later on.
  2. 问题不在'é',而是在以后发生。

  3. It sounds like the character encoding may be hard coded in your application somewhere.
  4. 听起来字符编码可能会在您的应用程序中进行硬编码。

#7


0  

Also, you may want to verify that operating system packages to support UTF-8 (SUNWeulux, SUNWeuluf etc) are installed.

此外,您可能需要验证是否已安装支持UTF-8(SUNWeulux,SUNWeuluf等)的操作系统软件包。

#8


0  

Java uses operating system's default encoding while reading and writing files. Now, one should never rely on that. It's always a good practice to specify the encoding explicitly.

Java在读取和写入文件时使用操作系统的默认编码。现在,人们永远不应该依赖于此。明确指定编码始终是一个好习惯。

In Java you can use following for reading and writing:

在Java中,您可以使用以下内容进行读写:

Reading:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputPath),"UTF-8"));

Writing:

PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));