如何在Java中找到默认的字符集/编码?

时间:2021-07-07 20:14:27

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

显而易见的答案是使用Charset.defaultCharset(),但是我们最近发现这可能不是正确的答案。我被告知,结果与java使用的实际默认字符集不同。io类有好几种情况。看起来Java保留了两组默认的字符集。有人对这个问题有什么见解吗?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

我们可以重现一个失败的案例。这是一种用户错误,但它仍然可能暴露所有其他问题的根本原因。这是代码,

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

我们的服务器需要Latin-1中的默认字符集来处理遗留协议中的一些混合编码(ANSI/Latin-1/UTF-8)。我们所有的服务器都运行这个JVM参数,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

这是Java 5的结果,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

有人试图通过设置文件来更改编码运行时。编码在代码中。我们都知道这行不通。然而,这显然会抛出defaultCharset(),但不会影响OutputStreamWriter使用的真正的默认字符集。

Is this a bug or feature?

这是bug还是特性?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

编辑:已接受的答案显示问题的根本原因。基本上,您不能信任Java 5中的defaultCharset(),这不是I/O类使用的默认编码。看起来Java 6纠正了这个问题。

6 个解决方案

#1


59  

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

这是很奇怪的…一旦设置好,默认的字符集就会被缓存,并且在类内存的时候它不会被改变。设置”文件。编码system . setproperty(“文件”属性。编码”、“latin - 1”);什么也不做。每次调用charset. defaultcharset()时,它都会返回缓存的charset。

Here are my results:

这是我的结果:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

不过我使用的是JVM 1.6。

(update)

(更新)

Ok. I did reproduce your bug with JVM 1.5.

好的。我确实用JVM 1.5复制了您的bug。

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

查看1.5的源代码,并没有设置缓存的默认字符集。

JVM 1.5:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        return cs;
    return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        defaultCharset = cs;
            else 
        defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

将文件编码设置为file时。下一次调用charset . defaultcharset()时,发生的情况是,由于缓存的默认charset没有设置,因此它将尝试为Latin-1名称找到适当的字符集。没有找到这个名称,因为它是错误的,并返回默认的UTF-8。

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

至于诸如OutputStreamWriter之类的IO类为什么会返回一个意想不到的结果,即sun.nio.cs的实现。StreamEncoder(这些IO类使用了witch)对于JVM 1.5和JVM 1.6也是不同的。JVM 1.6实现基于charsit . defaultcharset()方法,以获得默认编码,如果没有提供给IO类的话。JVM 1.5实现使用不同的方法Converters.getDefaultEncodingName();获取默认字符集。该方法使用它自己的默认字符集缓存,该字符集是在JVM初始化时设置的:

JVM 1.6:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
                          Object lock,
                          String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
    if (Charset.isSupported(csn))
        return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

但我同意这些评论。你不应该依赖这个性质。这是一个实现细节。

#2


23  

Is this a bug or feature?

这是bug还是特性?

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

看起来像未定义的行为。我知道,在实践中,您可以使用命令行属性更改默认编码,但我不认为这样做会发生什么。

Bug ID: 4153515 on problems setting this property:

关于设置此属性的问题,错误ID: 4153515:

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

这不是一个bug。”文件。编码“属性不是J2SE平台规范要求的;它是Sun实现的内部细节,不应该被用户代码检查或修改。它也是只读的;从技术上讲,在程序执行期间,不可能将该属性设置为命令行上的任意值。

The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

更改VM和运行时系统使用的默认编码的首选方法是在启动Java程序之前更改底层平台的语言环境。

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

当我看到人们在命令行上设置编码时,我感到很害怕——您不知道会影响到什么代码。

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

如果您不想使用默认编码,请通过适当的方法/构造函数显式地设置您想要的编码。

#3


4  

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

首先,Latin-1和ISO-8859-1是一样的,所以默认情况下您已经可以使用了。对吧?

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

使用命令行参数成功地将编码设置为ISO-8859-1。您还可以编程地将它设置为“Latin-1”,但这不是Java文件编码的认可值。参见http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

当您这样做时,从查看源代码来看,看起来Charset将重置为UTF-8。这至少可以解释大多数的行为。

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

我不知道OutputStreamWriter为什么显示了ISO8859_1。它委托给封闭式sun.misc。*类。我猜它不是通过相同的机制来处理编码,这很奇怪。

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

但是,当然,您应该始终指定在该代码中是什么编码。我从不依赖平台默认。

#4


4  

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

这种行为其实并不奇怪。考察类的实现,其原因如下:

  • Charset.defaultCharset() is not caching the determined character set in Java 5.
  • defaultcharset()不是缓存Java 5中确定的字符集。
  • Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset defaults to "UTF-8".
  • 设置“系统属性”文件。编码和调用Charset.defaultCharset()再次导致对系统属性的第二个评估,没有找到名为“Latin-1”的字符集,因此Charset.defaultCharset默认为“UTF-8”。
  • The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.
  • 然而,OutputStreamWriter缓存了默认字符集,并且可能已经在VM初始化期间使用,因此它的默认字符集将从Charset.defaultCharset()转移到系统属性“文件”。“编码”在运行时被更改。

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

正如已经指出的,在这种情况下,VM必须如何运行并没有记录。defaultcharset () API文档对于如何确定默认字符集不是很精确,只是提到它通常是在VM启动时执行的,基于操作系统默认字符集或默认语言环境等因素。

#5


3  

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

我将WAS服务器中的vm参数设置为-Dfile。编码=UTF-8以更改服务器的默认字符集。

#6


0  

check

检查

System.getProperty("sun.jnu.encoding")

it seems to be the same encoding as the one used in your system's command line.

它似乎与系统命令行中使用的编码相同。

#1


59  

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

这是很奇怪的…一旦设置好,默认的字符集就会被缓存,并且在类内存的时候它不会被改变。设置”文件。编码system . setproperty(“文件”属性。编码”、“latin - 1”);什么也不做。每次调用charset. defaultcharset()时,它都会返回缓存的charset。

Here are my results:

这是我的结果:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

不过我使用的是JVM 1.6。

(update)

(更新)

Ok. I did reproduce your bug with JVM 1.5.

好的。我确实用JVM 1.5复制了您的bug。

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

查看1.5的源代码,并没有设置缓存的默认字符集。

JVM 1.5:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        return cs;
    return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
    java.security.PrivilegedAction pa =
        new GetPropertyAction("file.encoding");
    String csn = (String)AccessController.doPrivileged(pa);
    Charset cs = lookup(csn);
    if (cs != null)
        defaultCharset = cs;
            else 
        defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you call Charset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

将文件编码设置为file时。下一次调用charset . defaultcharset()时,发生的情况是,由于缓存的默认charset没有设置,因此它将尝试为Latin-1名称找到适当的字符集。没有找到这个名称,因为它是错误的,并返回默认的UTF-8。

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in the Charset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

至于诸如OutputStreamWriter之类的IO类为什么会返回一个意想不到的结果,即sun.nio.cs的实现。StreamEncoder(这些IO类使用了witch)对于JVM 1.5和JVM 1.6也是不同的。JVM 1.6实现基于charsit . defaultcharset()方法,以获得默认编码,如果没有提供给IO类的话。JVM 1.5实现使用不同的方法Converters.getDefaultEncodingName();获取默认字符集。该方法使用它自己的默认字符集缓存,该字符集是在JVM初始化时设置的:

JVM 1.6:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
                          Object lock,
                          String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
    if (Charset.isSupported(csn))
        return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

但我同意这些评论。你不应该依赖这个性质。这是一个实现细节。

#2


23  

Is this a bug or feature?

这是bug还是特性?

Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.

看起来像未定义的行为。我知道,在实践中,您可以使用命令行属性更改默认编码,但我不认为这样做会发生什么。

Bug ID: 4153515 on problems setting this property:

关于设置此属性的问题,错误ID: 4153515:

This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

这不是一个bug。”文件。编码“属性不是J2SE平台规范要求的;它是Sun实现的内部细节,不应该被用户代码检查或修改。它也是只读的;从技术上讲,在程序执行期间,不可能将该属性设置为命令行上的任意值。

The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.

更改VM和运行时系统使用的默认编码的首选方法是在启动Java程序之前更改底层平台的语言环境。

I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.

当我看到人们在命令行上设置编码时,我感到很害怕——您不知道会影响到什么代码。

If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.

如果您不想使用默认编码,请通过适当的方法/构造函数显式地设置您想要的编码。

#3


4  

First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?

首先,Latin-1和ISO-8859-1是一样的,所以默认情况下您已经可以使用了。对吧?

You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

使用命令行参数成功地将编码设置为ISO-8859-1。您还可以编程地将它设置为“Latin-1”,但这不是Java文件编码的认可值。参见http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html

When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.

当您这样做时,从查看源代码来看,看起来Charset将重置为UTF-8。这至少可以解释大多数的行为。

I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.

我不知道OutputStreamWriter为什么显示了ISO8859_1。它委托给封闭式sun.misc。*类。我猜它不是通过相同的机制来处理编码,这很奇怪。

But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.

但是,当然,您应该始终指定在该代码中是什么编码。我从不依赖平台默认。

#4


4  

The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:

这种行为其实并不奇怪。考察类的实现,其原因如下:

  • Charset.defaultCharset() is not caching the determined character set in Java 5.
  • defaultcharset()不是缓存Java 5中确定的字符集。
  • Setting the system property "file.encoding" and invoking Charset.defaultCharset() again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset defaults to "UTF-8".
  • 设置“系统属性”文件。编码和调用Charset.defaultCharset()再次导致对系统属性的第二个评估,没有找到名为“Latin-1”的字符集,因此Charset.defaultCharset默认为“UTF-8”。
  • The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset() if the system property "file.encoding" has been changed at runtime.
  • 然而,OutputStreamWriter缓存了默认字符集,并且可能已经在VM初始化期间使用,因此它的默认字符集将从Charset.defaultCharset()转移到系统属性“文件”。“编码”在运行时被更改。

As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset() API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.

正如已经指出的,在这种情况下,VM必须如何运行并没有记录。defaultcharset () API文档对于如何确定默认字符集不是很精确,只是提到它通常是在VM启动时执行的,基于操作系统默认字符集或默认语言环境等因素。

#5


3  

I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.

我将WAS服务器中的vm参数设置为-Dfile。编码=UTF-8以更改服务器的默认字符集。

#6


0  

check

检查

System.getProperty("sun.jnu.encoding")

it seems to be the same encoding as the one used in your system's command line.

它似乎与系统命令行中使用的编码相同。